# online_learning_with_dependent_stochastic_feedback_graphs__3eee01f1.pdf Online Learning with Dependent Stochastic Feedback Graphs Corinna Cortes 1 Giulia De Salvo 1 Claudio Gentile 1 Mehryar Mohri 1 Ningshan Zhang 2 A general framework for online learning with partial information is one where feedback graphs specify which losses can be observed by the learner. We study a challenging scenario where feedback graphs vary stochastically with time and, more importantly, where graphs and losses are dependent. This scenario appears in several realworld applications that we describe where the outcome of actions are correlated. We devise a new algorithm for this setting that exploits the stochastic properties of the graphs and that benefits from favorable regret guarantees. We present a detailed theoretical analysis of this algorithm, and also report the results of a series of experiments on real-world datasets, which show that our algorithm outperforms standard baselines for online learning with feedback graphs. 1. Introduction Prediction with expert advice is a classical framework for modeling the repeated interactions between a learner and the environment (Littlestone & Warmuth, 1994; Cesa-Bianchi & Lugosi, 2006). In this framework, the learner maintains performance estimates for a set of experts (or actions), based on their past behavior. At each round, she uses those estimates to select an action, which incurs a loss determined by the environment, and subsequently updates her estimates. The learner s objective is to minimize her regret, that is the difference between her cumulative loss over a finite number of interactions and that of the best expert in hindsight. The two most standard settings in this framework are the full information setting, where the learner is informed of the loss incurred by all actions, and the bandit setting, where she can only observe the loss incurred by the selected action. These settings are both special instances of a general model *Equal contribution 1Google Research, New York, NY 2Hudson River Trading, New York, NY. Correspondence to: Giulia De Salvo . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). of online learning introduced by Mannor & Shamir (2011), where loss observability is specified by a feedback graph. In a directed feedback graph, each vertex represents an action and an edge from vertex i to vertex j indicates that the loss of action j is observed when action i is selected. The bandit setting corresponds to a feedback graph reduced to only self-loops at each vertex, the full information setting corresponds to a fully connected graph. Online learning with feedback graphs has been further extensively analyzed by Alon et al. (2013; 2017) and by a series of other recent publications (Caron et al., 2012; Alon et al., 2015; Kocák et al., 2014; Neu, 2015; Cohen et al., 2016; Buccapatnam et al., 2014b; Wu et al., 2015a; Tossou et al., 2017; Liu et al., 2018; Yun et al., 2018; Cortes et al., 2019; Li et al., 2020). In the scenarios of online learning with feedback graphs studied by some authors, the feedback graph G is assumed to be fixed over time (Caron et al., 2012; Buccapatnam et al., 2014b; Neu, 2015), while others allow a time-varying feedback graph Gt, chosen adversarially at each round t (Alon et al., 2013; 2017; 2015; Kocák et al., 2014; Cohen et al., 2016; Kocák et al., 2016). This paper considers a scenario commonly appearing in practice where feedback graphs are time-varying but are stochastic, that is, at each round t, the feedback graph Gt is drawn i.i.d. according to some unknown distribution. Crucially, we will not assume that the losses and feedback graphs are statistically independent, a key assumption adopted by many authors, including Cohen et al. (2016); Tossou et al. (2017); Liu et al. (2018); Li et al. (2020), in their analysis of (partially accessible) feedback graphs. The novelty of the setting presented in this paper is thus not just to study stochastic feedback graphs, but more importantly, to allow for an underlying dependence between the feedback graphs and the losses, which as we will see makes the analysis intrinsically harder. We call this new setting online learning with dependent stochastic feedback graphs highlighting that losses and graphs are dependent stochastic variables. Our scenario of online learning with dependent stochastic graphs is relevant to many real-world problems. As an example, consider the problem of a doctor selecting an appropriate treatment (action) for a patient, based on his symptoms. The patient can be modeled by a feature vector xt representing his health history and other characteristics and we assume a fixed distribution over the patients xt Dependent Stochastic Feedback Graphs visiting that doctor. Each xt induces a feedback graph over treatments, since the success of treatment i for curing a collection of symptoms that the patient displays could also reveal the effectiveness of an alternate treatment j. Thus, just as xt, the feedback graph is stochastic and since both the loss and the feedback graph depend on xt, they are not independent. More generally, the scenario of online learning with stochastic feedback graphs is pertinent to a variety of problems where a stochastic context or feature vector xt induces a relationship between experts or actions, as the outcome of one expert for xt also reveals information about that of some other experts. We will discuss in more detail several examples of such structured experts. Our study admits some connection with that of Kocák et al. (2016), who examine a scenario of online learning with an imperfect feedback, modeled by a weighted graph, or the closely related one of Wu et al. (2015b); however, the definition of the edge weights in those studies is entirely distinct from ours. Related to feedback graph problems, but in settings quite different from ours, Yun et al. (2018) consider stochastic bandits where the extra information admits a cost to be traded off against regret, while Cortes et al. (2019) analyze feedback graphs in the context of the sleeping experts. Our setting naturally raises the following question: how can we leverage the stochastic properties of the graphs while at the same time dealing with the dependency between losses and graphs? One can prove that simply averaging loss observations of an expert based on the feedback graph Gt may result in an arbitrarily biased empirical estimate of the expected loss of the expert, since there is a dependency between losses and graphs and since we make no assumptions on the nature of this dependency. In order to avoid this pitfall, we introduce an estimated feedback graph b Gt based on past observations up to time t 1 instead of relying on the feedback graph Gt at the time t. The design of our algorithms then relies on the observability specified by b Gt so that the empirical estimates defined by averaging loss observations based on the estimated graphs will be unbiased. Specifically, the empirical estimate of the performance of action j is updated whenever action, It, is selected by the algorithm and there is an edge from It to j in b Gt. Using estimated graphs, however, leads to another complication: there are times t when even though there is an edge from It to j in b Gt, the loss of action j is not revealed, which is when graph Gt does not contain an edge from It to j. In order for these events not to occur too often, we introduce a threshold t to remove edges with low probability in the estimated graph. That is, depending only on information up to time t 1, we estimate the probability of an edge from vertex i to vertex j and define, at each round t, the graph b Gt whose edges admit a probability above this threshold t. Furthermore, over these events, since the loss of action j is not revealed, we also introduce an estimated loss used to update the empirical estimate of the performance of expert j. These estimated losses effectively estimate the expectation of the loss of expert j when there is no edge from i to j in graph Gt. Intuitively, denser feedback graphs imply that more loss observations are revealed and hence, the algorithm should converge faster since the empirical estimates more quickly converge to their respective means. The novelty of our approach is thus based on carefully leveraging the interplay between the estimated graphs, thresholds, and estimated losses in order to reliably update the empirical estimates of the expected loss of an expert via the densest graph possible, thereby circumventing the dependency of losses and graphs while also exploiting the presence of edges in the feedback graph in order to attain favorable theoretical guarantees. We start with a formal description of the scenario of online learning with dependent stochastic feedback graphs and introduce the relevant notation (Section 2). In Section 3, we present a novel algorithm that exploits the stochasticity of the feedback graphs and carefully deals with the dependency between losses and graphs. In Section 4, we prove that our algorithm benefits from favorable pseudo-regret guarantees that are logarithmic in the number of rounds T and are expressed in terms of the expected feedback graph, E[Gt]. In that section, we also illustrate, through a lower bounding argument, that the dependency structure between losses and graphs may render the extra side information delivered by feedback graphs mostly useless. In Section 5, we discuss a natural scenario with stochastic graphs and report the results of several experiments on real-world datasets suggesting that our algorithm outperforms standard baselines. 2. Learning Scenario We consider the familiar sequential learning problem over T rounds, where the algorithm has access to a set of K experts (or actions) E = { 1, . . . , K}, which we abusively identify as E = {1, . . . , K} in some cases. At each round t 2 [T], each expert i 2 E is assigned a loss t( i) 2 [0, 1]. The algorithm selects an expert It and incurs the corresponding loss t( It). Additionally, at round t 2 [T], the algorithm is supplied with a feedback graph Gt = (E, Et), where each vertex corresponds to an expert, and where the presence of an edge from vertex i to vertex j indicates that the loss of expert j is revealed to the algorithm if it selects expert i at time t. In what follows, we denote by Nt(i) the outneighborhood of vertex i in Gt, that is the set of vertices j that can be reached from i via an edge in Et. Note, the feedback graph Gt only specifies the observability of expert losses and not loss values. Dependent Stochastic Feedback Graphs We assume, in what follows, that the graphs Gt admit selfloops at all vertices, that is i 2 Nt(i) for all i 2 E. This guarantees that the information received by the algorithm includes at least the loss value t( It) of the action It it selects, that is the information available in the bandit setting. We consider the setting where expert losses and feedback graphs are stochastic and arbitrarily dependent. Let G denote the family of all directed graphs with K vertices (and self-loops on all of them), and let D be a distribution over G [0, 1]K unknown to the algorithm. At each round t 2 [T], the environment generates the pair (Gt, t) 2 G [0, 1]K i.i.d. according to D, where t = ( t( 1), . . . , t( K)) is the vector of all K expert losses. In particular, this setting includes the important learning scenario of statistical learning, so far not studied in the bandits literature, where pairs (xt, yt) are drawn i.i.d. according to some distribution over X Y, where X is the input space and Y is the output space. Expert i is a mapping from X to Y and its loss at time t is given by t( i) = L( i(xt), yt), for some loss function L. The feedback graph Gt is then typically a function of xt. As an example, consider the following sequence of graphs: at each time t, graph Gt admits (bi-directional) edges (i, j) whenever i(xt) = j(xt). The probability pi,j of an edge in this graph is then given by pi,j = P ({x: i(x) = j(x)}), where pi,i = 1 if the graph admits self-loops. Experts may be, for example, decision trees with the same function value in a given region of the space, which thereby implies that these trees will admit the same loss in this region, and the graph Gt will then account for the shared loss observability between these trees whenever an xt falls in this region. In such scenarios, Gt and the losses t are dependent as they are both functions of xt. Note that the full vector of losses t is generated i.i.d. at each time t, but its components t( i) can be dependent. Likewise, the graphs Gt are i.i.d. across time but, for any given t, their edges (i, j) may be correlated. Another instance of this scenario is learning with abstention, where an expert can elect to either make a prediction or abstain, at a cost typically less than that of misclassification (Cortes et al., 2016; 2018). In the stochastic case, both the loss vector and the feedback graph at round t are functions of xt with the feedback graph Gt defined by the following: if the expert selected by the algorithm elects to predict on input xt, then the label yt is revealed, and the loss t( i) = L( i(xt), yt) of every expert i is revealed; otherwise, It abstains, the label yt is not revealed and only the loss of It along with that of other abstaining experts is revealed, which is some fixed abstention cost c. We consider the so-called uninformed case, where graph Gt is revealed to the algorithm only after it has selected an action to play. The standard measure of the performance of the algorithm is the pseudo-regret, RT , defined as follows: t( It) t( ) where the expectation is taken over the i.i.d. sequence of pairs (Gt, t) drawn from D. 3. Algorithm In this section, we present our new algorithm UCB-DSG, or UCB with Dependent Stochastic Graphs, for the stochastic and loss-dependent feedback graph scenario just discussed. In Section 4, we prove that this algorithm admits favorable pseudo-regret guarantees. The design of UCB-DSG builds on the classical UCB algorithm of Auer et al. (2002a). The pseudocode of UCB-DSG is given in Algorithm 1. At a high level, UCB-DSG maintains an empirical estimate bµi,t for the average loss µi = E[ ( i)] of each expert i 2 E. At each round, it selects the expert with the smallest lower confidence estimate, and uses the loss observations from a feedback graph, defined below, to update its estimates. Ideally, the algorithm would update its empirical means at round t according to graph Gt. However, since Gt may not be a complete graph and may depend on the losses t( i), simply taking an average over the losses observed could lead to biased empirical estimates. As a straightforward example, consider a graph Gt that, when selecting expert i, also reveals the loss of expert j, but only when t( j) > 0.5. For a detailed discussion of this technical bias problem in the context of the online learning with abstention, see (Cortes et al., 2018). Section 4 below contains an illustrative example in the form of a lower bound that shows that even if the (marginal) probabilities pi,j of edges (i, j) are all close to 1, we cannot in general take advantage of the presence of these edges when graphs and losses are statistically dependent. The first key idea behind the design of our algorithm is to use at time t a surrogate graph b Gt derived from an average of past observed graphs G1, . . . , Gt 1, which bypasses this technical bias issue. However, the loss of an expert j in the out-neighborhood b Nt(It) of graph b Gt may not be actually observed at round t. This is because, in general, graphs Gt and b Gt do not coincide and Nt(It) may not contain j. In order to control the estimation error from such cases, we introduce a threshold that is used to trim the estimated graph of any low-probability edges. Moreover, in these cases, UCB-DSG resorts to an estimated loss e t( j) for expert j since the true loss is not revealed. The threshold and the estimated losses are carefully crafted to guarantee that the true mean of each expert is, with high probability, within a confidence interval around the empirical mean. Specifically, there is a critical interplay between the estimated losses and thresholds to favor a denser graph while at the same time Dependent Stochastic Feedback Graphs ALGORITHM 1: UCB-DSG Parameters: Experts E = { 1, . . . , K}, # of rounds T; Init: Ui,0 = Qi,0 = 1 for all i 2 [K]; for t 2 [T] do Si,t 1 O(log(Kt)) p Qi,t 1 ; {Constants in App.B} It argmini2[K](bµi,t 1 Si,t 1); b Nt(It) {j 2 [K]: bpt 1 It,j > j,t 1}; {Create b Gt} for j 2 [K] do if j 2 b Nt(It) then Qj,t Qj,t 1 + 1; if j 2 Nt(It) then bµj,t t( j) Qj,t + (1 1 Qj,t )bµj,t 1; {Use true loss} else Qj,t + (1 1 Qj,t )bµj,t 1; {Use estimated loss} else Qj,t Qj,t 1; Sj,t Sj,t 1; bµj,t bµj,t 1; {No update} if bpt 1 It,j > 1 O(log(Kt))+1 p Uj,t 1+1 then Uj,t Uj,t 1 + 1; for i 2 [K] do e j,i,t t( j)I{j62Nt(i),j2Nt(It)} e j,i,t 1; else Uj,t Uj,t 1; e j,i,t e j,i,t 1, 8i 2 [K]; {No update} . reliably updating the empirical estimates, bµi,t. To devise our surrogate graph b Gt, for t > 2 and for all i, j 2 [K], we use an unbiased estimate of the probability of a directed edge from i to j, bpt 1 i,j , based on the information available up to time t 1: s=1 I{j 2 Ns(i)} where I{ } denotes the indicator function of the argument, and bp0 i,j = 0. We define b Gt by allowing an edge from i to j only if bpt 1 i,j is above the aforementioned threshold. That is, the graph b Gt is determined by the out-neighborhoods j 2 [K]: bpt 1 i,j > j,t 1 [ {i}, with the threshold j,t 1 specified by j,t 1 = min Uj,t 1 + 1 Qj,t 1 + 1 O(log(Kt)) p Qj,t 1 + 1, 1 O(log(Kt)) p In the above, Qj,t 1 is the number of times expert j was updated up to time t 1 and similarly, Uj,t 1 is the number of times the estimated loss, e t( j), has been updated up to time t 1. The exact constants in front of the log terms hidden in the big-O notation are given in Appendix B. Note that, by construction, b Gt admits self-loops at all vertices. Since the algorithm uses the estimated loss of expert j when the loss of expert j is not revealed, that is when j 62 Nt(It), our estimated loss aims at estimating the following conditional expected loss: E[ t( j)|j 62 Nt(It)] = E[ t( j)I{j 62 Nt(It)}] E[I{j 62 Nt(It)}] . The denominator in the above fraction can be readily estimated by 1 bpt 1 It,j, which converges to its (conditional) mean, 1 p It,j. Concretely, for each pair i, j 2 K K and time t 2 [T], we maintain estimates bpt 1 i,j and, at time t, the probability estimate bpt 1 i,j corresponding to i = It is used. On the other hand for the numerator E[ ( j)I{j 62 N(i)}], we need to introduce a new estimator and so we define e j,i,t = 1 Uj,t s( j)I{j 62 Ns(i)}I{j 2 Ns(Is)}Cj,s Is,j > 1 O(log(Ks))+1 p for all j, i, and t. Note that the sum in the definition of e j,i,t must be over the times s such that j 2 Ns(Is) since the loss of expert j is revealed only when playing action Is. A careful reader would notice that this injects a bias in the estimator e j,i,t, but the condition defining Cj,s, which holds whenever the (estimated) probability of the edge from Is to j is large enough, is chosen so as to guarantee that e j,i,t converges to E[ ( j)I{j 62 N(i)}]. One may, in fact, adopt a condition like Cj,s in order to define the estimated graph b Gs directly, but this would result in a worse regret guarantee, since Cj,s is provably more conservative than the condition bps 1 Is,j > j,s 1 used in b Gs. Putting the above together, our estimated loss is defined as e t( j) = min e j,It,t 1 1 bpt 1 Thresholds j,t and estimated losses e t( j) together play a key role in the regret guarantee of our algorithm. As we will prove in our analysis, a small threshold j,t (implying that the algorithm will resort to estimated losses more often) results in a more favorable regret bound. The value of j,t in (1) chiefly depends on the interplay between the two quantities Uj,t and Qj,t. One can show by induction Dependent Stochastic Feedback Graphs (see Appendix B) that Uj,t 6 Qj,t + 1 for all j and t. Hence, if Uj,t is as large as Qj,t (which implies that e t( j) is a very accurate estimator of its own expectation), then j,t will be small. On the other hand, when Uj,t is much smaller than Qj,t, then the estimated loss e t( j) is a less reliable estimator. In this case, the threshold is j,t = 1 2 O(log(Kt)) p Qj,t+1 , which will be large if action j has already undergone many updates. In particular, if the algorithm has already resorted to e t( j) very often (and e t( j) is not dependable because of a small Uj,t), a large j,t will ensure that the algorithm will rely on e t( j) less frequently in the future. Finally, j,t also mildly decreases with time, due to its logarithmic dependence on t through O(log(Kt)). In summary, at each round t 2 [T], UCB-DSG selects the expert with the smallest optimistic empirical loss, that is, It = argmini2[K] bµi,t 1 Si,t 1, and operates with the surrogate graph b Gt to update all estimates bµj,t, for j 2 b Nt(It), by either using a true loss t( j), when available (j 2 Nt(It)), or a estimated loss e t( j) otherwise. Notice that, since the definition of b Gt only depends on graphs G1, . . . , Gt 1 and not on Gt, Algorithm 1 need not receive graph Gt before playing action It (uninformed setting). 4. Theoretical Guarantees In this section, we analyze the UCB-DSG algorithm just described and prove that it benefits from favorable logarithmic pseudo-regret guarantees in terms of the independent sets of certain graphs derived from the matrix of probabilities [pi,j]K K i,j=1 . Our regret upper bound is given below, a proof sketch is provided in Section 4.1 and detailed in Appendix B. In Section 4.2, we complement our upper bound with a lower bound showing that in the case of strong dependency between graphs and losses, even when pi,j is close to 1 for all i and j, we cannot in general achieve better regret performance than O(K). As in standard UCB pseudo-regret bounds, our guarantees are expressed in terms of the loss gaps j = µj µi , j 2 [K], where µj is the expected loss of j and µi = minj2[K] µj that of the best expert, i . Naturally, our results are also expressed in terms of graph properties. For any i, j 2 [K], let pi,j denote the marginal probability of an edge from i to j. We define G as the weighted directed graph that admits an edge from i to j with weight pi,j, for any i, j 2 [K]. G can then be viewed as the process graph from which graphs Gt are drawn i.i.d., or their expectation, E[Gt]. For any vector of thresholds = { 1, . . . , K} 2 [0, 1]K, we define G = {E, E } as the unweighted and undirected graph derived from G by keeping only those edges in G that satisfy both pi,j > j and pj,i > i. In our bound, we only consider strictly positive loss gaps, j > 0 and omit this condition to avoid a notational burden. Recalling that an independent set, I(G), of an undirected graph G is the set of vertices such that no two vertices in I(G) are adjacent, we denote by I(G) = {I1(G), I2(G), . . .} the set of all independent sets of graph G. The bound in the theorem that follows is in terms of the independent sets of G , for suitably chosen . Theorem 1. Let j 2 [K]: pi ,j > 1 O 11 log(KT ) and for γ 2 [0, 1), let (γ) = ( 1(γ), . . . , K(γ)) be defined as 1 γ for i 2 K 1 O(log(KT )) p T for i /2 K . The pseudo-regret RT of UCB-DSG over T rounds is bounded by: min γ2[0,1) max I2I(G (γ)) ln(KT ) log(T ) j + f(D,K) 1 (1+γ)2/4 In the above, f(D, K) is a function linear in the number of experts K and inversely proportional to each gap 2 D = { j : j 2 [K], j 6= i }, but independent of T. The denser the graph G (γ) the smaller the number of the independent sets, and thus the more favorable the bound is. Recalling that the independence number of a graph is the size of the maximum independent set of the graph, the number of terms in the above sum over any independence set, I, is upper bounded by the independence number, (γ), of G (γ). Thus, in the case that j = for all j 6= i , this sum can be bounded by (γ) ln(KT) log(T). Moreover, γ 2 [0, 1) is a free parameter that allows for a limited tradeoff between the two terms in the sum. As γ increases, the number of independent sets decreases while the second term f(D, K)/(1 (1 + γ)2/4) increases, and vice-versa. Figure 1 contains an example of the thresholded process graph G (γ) defined in the theorem. Due to how the thresholds i(γ) are defined, more edges connecting nodes within K are included in the graph G (γ), as compared to the others, thereby implying that the larger the set K is, the better the resulting regret bound. More concretely, if T is large and γ is a constant (say, γ = 1/2), then 1 γ 6 1 O(log(KT )) p T . Thus, edges connecting actions within K are more likely to be included in the thresholded graph G (γ) than edges crossing K or connecting nodes outside K , so that K is more likely to form a clique in G (γ). In this latter case, (γ) 6 K |K | + 1 and hence, the bigger K the more favorable the regret bound is. Notice that the size of K is also influenced by the gaps js. In particular, for a given Dependent Stochastic Feedback Graphs p3,4 1 O(log(KT )) p T pi ,3 1 γ 4 pi ,2 1 γ !! log(KT ) " 1/2" pi ,2 1 O !! log(KT ) p4,3 1 O(log(KT )) p Figure 1. Example of the thresholded process graph G (γ). Ex- perts i , 2 and 3 are in the set K as the probability of the incoming edge from i is big enough (bottom two inequalities in gray). Notice that i 2 K because of the self-loop. Edges between experts i and j in K are included in G (γ) if both pi,j and pj,i are > 1 γ. Moreover, since typically 1 O(log(KT )) p T > 1 γ, any edge connected to expert 4 62 K must admit a probability greater than 1 O(log(KT )) p T to be in the graph G (γ). Thus, more edges connecting experts within K are included in graph G (γ). The independence number of this graph is (γ) = 2. matrix [pi,j]K i,k=1, the smaller the gaps j (i.e., the harder the bandit problem), then the smaller K . For an expert j with a small gap j, the corresponding edge weight pi ,j must be larger to be included in K . Said differently, if the best expert i is highly connected to experts with large expected losses, the pseudo-regret bound is favorable even if their connecting edge weights are comparatively small. The standard UCB bound corresponds to a sum over all experts and since G admits self-loops at all vertices with weight one, our guarantee is always at least as favorable. In the special case where the probability matrix [pi,j]K i,k=1 is binary and symmetric, then graphs Gt are deterministic, and all coincide with G. In this case, our bound coincides with that of Lykouris et al. (2020) [Theorem 3.2], which holds for a UCB-type algorithm, called UCB-N, that averages all available loss observations. In Appendix B (see Theorem 3 therein), we also prove a regret guarantee in terms of the clique partition number of G (γ) which, in the case of deterministic graphs, matches the clique-based bound for UCB-N in (Caron et al., 2012)[Theorem 2]. Even though summing over nodes of an independent set is smaller than summing over a clique cover, the bound based on cliques shaves off a log(T) factor compared to that in Theorem 1. In the setting of fixed undirected graphs, Buccapatnam et al. (2014a) present an interesting action elimination algorithm that solves at every round a linear program based on the feedback graph. The authors prove favorable guarantees in terms of the minimum dominating set of the graph. Unfortunately, several difficulties arise when adapting (Buccapatnam et al., 2014a) to our setting of dependent stochastic graphs. From a computational standpoint, we would have to solve a linear program at every round based on the feedback graph s edge probabilities bpt 1 i,j . Moreover, we would need to know the time horizon T in advance (a key step of our analysis of UCB-DSG heavily depends on this property). One might instead estimate the pi,js in an initial stage where no actions are eliminated and then, say, adopt the strategy in (Li et al., 2020) that works in the case when the stochastic graph is independent of the losses and the pi,j are known. However, even this solution looks problematic in our setting. This is because the duration of this initial stage should be at least p T in order to insure accurate estimation and, if we don t rely on feedback graphs during this stage we would suffer anyway a regret of the form K log T. More importantly, observe that even if we know the marginal probabilities pi,j exactly, we still have to face the loss bias issue intrinsic to our dependency assumptions. 4.1. Sketch of the Pseudo-regret Analysis Here, we sketch the analysis that proves Theorem 1. Please see Appendix B for the details. At a high level, the proof proceeds via two steps: 1) we prove that bµi,t converges to its mean at a rate of e O(1/ Qi,t) and then bound the regret using a UCB-type analysis; 2) we relate the resulting bound to the independent sets of the process graph, G (γ). For step 1), we first quantify the accuracy of our loss estimates e t( j) (Lemma 1 and Lemma 2 in Appendix B). Specifically, Lemma 1 shows that even though e i,j,t is a biased estimator, it still concentrates around E[ ( i)I{i 62 N(j)}]. The key idea is to leverage the fact that the majority of observations used in this estimator are revealed from edges with large probability, thereby implying that this estimator mostly includes reliable observations. Then, Lemma 2 proves that conditioned on the past, e t( i) converges to its expectation at a rate of e O(1/ Ui,t). By using the definition of threshold i,t 1 in conjunction with these two lemmas, Proposition 1 in Appendix B bounds the error term i,t due to resorting to estimated losses when constructing estimator bµi,t: i,t = 1 Qi,t ( s( i) e s( i))I i 2 b Ns(Is), i 62 Ns(Is) Proposition 1 states that | i,t| 6 e O(1/ Qi,t). The interplay between the threshold and estimated loss is crucial in the proof of this proposition. In particular, it is this interplay that determines the threshold s key component Ui,t 1 Qi,t 1 . This implies that bµi,t + i,t is an unbiased estimate that concentrates around its mean µi and then, a UCB-type analysis shows that an expert i 6= i cannot be pulled too often. For step 2), we connect the sequence of graphs b G1, . . . , b Gt generated by the algorithm to the thresholded process graph G (γ). Proposition 2 shows that the out-neighborhoods of G (γ) are contained in the out-neighborhoods of the graph b Gt. In contrast to the thresholds i,t generated by the al- Dependent Stochastic Feedback Graphs gorithm, the thresholds i(γ) applied to process graph in Theorem 1 are both algorithm and time independent. To find such thresholds, we leverage a high probability statement on the number of times an expert is pulled (Lemma 3), and use several properties of Ui,t and Qi,t including that Ui,t 6 Qi,t + 1 (Lemma 4), thereby connecting such thresholds to the loss gaps, j. Lastly, following ideas in (Lykouris et al., 2020), we show in Proposition 3 that the regret bound depends on the the independent sets of graph G (γ). 4.2. Lower Bound We now show that the intrinsic difficulty of learning with dependent stochastic feedback graphs does not derive from the stochasticity of the graphs (or the lack of prior knowledge of probabilities pi,j) but, rather, from the arbitrariness of the dependency structure of graphs and losses. Theorem 2. For any number of arms K > 2, any gap value 2 (0, 1/4), any edge probability p 2 [0, 1), and any strongly consistent policy1 there exists a dependent stochastic feedback graph problem with process graph G with pi,j = p for all i 6= j, pi,i = 1 for all i, and expected losses µ1, . . . , µK, with µi µi = (1 p) for all i 6= i , such that RT = ( K In the above lower bound, p can be arbitrarily close to 1, but is assumed to be constant (independent of , K, and T). Notice that this lower bound holds even if the algorithm knows p. Yet, this result does not violate the upper bound in Theorem 1, since that theorem refers to the independence number of an undirected graph G (γ) that retains enough edges (i, j) only when |K | is big, which in turn happens only if the probabilities pi ,j defining K increase to 1 as T ! 1 and since f(D, K) increases as the gaps decrease. The lower bound also illustrates the stark difference between dependent and independent feedback graph problems. For instance, if Gt is an undirected Erdos-Renyi stochastic graph with probability p of independently generating each edge, then the gap-dependent pseudo-regret upper bound is of the form log K p log T, i.e., logarithmic in K rather than linear. This conclusion can be extracted from both (Buccapatnam et al., 2014a) [Remark 5] and (Li et al., 2020) [Theorem 3], and essentially follows from the fact that the independence number of an Erdos-Renyi graph with K nodes and edge probability p is (2/p) log K (Frieze, 1990). 5. Experiments In this section, we present several experiments testing the UCB-DSG algorithm on multiclass classification problems. We focus on a regime where the number of rounds, T = 10,000, is relatively small compared to the number of 1A strongly consistent policy is one such that RT = o(T ) for all 2 (0, 1). experts, K = 1,000. This is a regime that fully illustrates the need to leverage the side information provided by feedback graphs, and where we expect vanilla UCB (one of our baselines) to experience slower convergence capabilities. Consider the standard multi-class setting with c classes and the 0/1 loss L( (x), y) = I{ (x) 6= y}. Each expert, i for i 2 [K], consists of c hyperplanes, [w1 i , . . . , wc i ], each drawn randomly from a Gaussian N(0, 1)d, where d is the input dimension. Expert i labels input x according to the standard scoring function i(x) = argmaxy2[c] wy i x. We define graphs Gt as follows: Gt admits a (bi-directional) edge from i to j if and only if i(xt) = j(xt). If two experts predict the same class, then they admit the same loss and hence, their loss observability is precisely captured by the graph Gt. See Appendix A for further illustration. We first compared UCB-DSG to the vanilla UCB algorithm, which only receives the loss of the chosen expert, and to the Fully Supervised (FS) algorithm, an unrealistic baseline violating our scenario, which at every round chooses the expert with the smallest empirical loss and receives the loss of every expert. We present results for the CIFAR dataset, as well as ten datasets from the UCI repository: letter, pendigits, poker, satimage, shuttle, segment, covtype, acoustic, HIGGS and sensorless (see Appendix D for dataset statistics). The features of each of the UCI datasets were scaled to [ 1, 1]. For the CIFAR dataset, we extracted via PCA the first 25 components and used them as features. The experiment was set up as follows. We randomly draw four times a set of hyperplanes, {[w1 i , . . . , wc i ] : i 2 [K]}, and for each set of hyperplanes, we randomly shuffle the data six times. Our results are averages over these 24 runs. For all algorithms, since the constants in front of the log terms are artifacts of the analysis, we introduced a parameter λ 2 [0.1, 0.5, 1, 10, 100], and tuned each algorithm over this parameter as is standard practice. Specifically, the λ multiplies the slack terms in the confidence intervals. For UCB-DSG, we tune the algorithm over two λs: one for the slack, Si,t, and threshold, i,t, as they both contain terms of the form 1/ Qi,t, and one for the estimated loss conditions Ci,t, as it contains a term of the form 1/ Ui,t. For each algorithm, we ran the above experiment for the different values of λ and report the results of the value of λ that admits the smallest regret at the last round. Figure 2 (top row) shows the averaged regret results for some of the datasets and Appendix D contains the plots for all datasets. These figures show that UCB-DSG outperforms UCB on all eleven datasets except for two, for which it admits a comparable performance. UCB-DSG attains a performance that in a few cases is even close to FS. Dependent Stochastic Feedback Graphs Figure 2. Top row is the average regret Rt/t as a function of t (log-scale) for UCB, UCB-DSG, and FS and the bottom row is the average regret Rt/t as a function of t (log-scale) for ALG-1, ALG-2, and UCB-DSG. Each column refers to different dataset ordered in the following way starting from the left: pendigits, shuttle, segment, satimage, and covtype. We then compared UCB-DSG against two other baselines that we designed based on the concepts analyzed in this paper, ALG-1 and ALG-2. Both algorithms pick the expert with the smallest lower confidence bound It = argmini2[K] bµi,t 1 Si,t 1, but the way they update the experts empirical estimates differs. The first baseline, ALG-1, is in the full information setting and is not relying on feedback graphs. ALG-1 updates the chosen expert by using its true loss, and uses 0 t( i) = L( i(xt), It(xt)) to update all other experts i 6= It. Note that, whenever It admits zero loss, all 0 t( i) = t( i); however, ALG-1 could suffer a potentially linear regret, as the algorithm may add noisy labels at each round that corrupt the empirical estimates. ALG-2 updates any expert that predicts the same way as It by using the true loss, that is, it updates all i 2 [K] that satisfy i(xt) = It(xt) by using the loss L( It(xt), yt). This algorithm is the natural extension of UCB-N of Caron et al. (2012) to this setting but, again, it may suffer linear regret since it relies on biased empirical estimates. To see why, suppose that ALG-2 selects 1, and that 1(x) = 2(x) only on those x such that 1(x) = 2(x) = m, for some m 2 [c]. Then, the empirical estimate for 2 may be biased, because it is averaged only over rounds t when 2(xt) = m. Despite neither of these two algorithms are theoretically motivated, they are natural baselines. In our experiments (see the bottom row of Figure 2 as well as Appendix D), we found that UCB-DSG is in fact outperforming both baselines on all datasets except for three, letter, HIGGS and CIFAR. For letter and CIFAR datasets, UCB, ALG-1, ALG-2 and UCB-DSG all admit a similar performance, which seems to suggest that feedback graphs are not beneficial for these two datasets. For the HIGGS dataset, ALG-1, ALG-2 and UCB-DSG algorithms admit a performance close to FS, thereby indicating that given the current experimental setup, there is no opportunities for improvements for this dataset. In Appendix D, we further discuss how the density of the graphs affects the algorithms performance. Altogether, these experiments show that, in a natural scenario where feedback graph and losses are dependent, our algorithm UCB-DSG achieves a performance that is substantially more favorable than that of readily available baselines. 6. Conclusion We initiated the analysis of online learning with stochastic feedback graphs in a setting often emerging in applications where graphs and losses are statistically dependent. Our algorithm estimates the probability of an edge via past realizations of the stochastic graphs and, based on carefully chosen thresholds, uses estimated losses to update its empirical estimates. Our pseudo-regret bound is in terms of thresholded distributional properties of the process generating the feedback graphs. In this setting, our algorithm benefits from strong theoretical guarantees that become more favorable in cases where playing the best action tends to reveal more losses of the other actions. We have complemented our upper bound with a regret lower bound that illustrates the inherent hardness of the general dependent feedback graph setting. Finally, we report a series of experiments showing the favorable empirical performance of UCB-DSG, as compared to the UCB algorithm, as well as to readily available UCB-based variants of UCB-DSG that propagate information across actions in a way not supported by our theory. It can be shown that, under some assumptions, our analysis can be extended to the more general setting where feedback graphs may not admit self-loops at all vertices. This is an important setting that includes naturally active learning scenarios, as discussed in Appendix C. Dependent Stochastic Feedback Graphs Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y. From bandits to experts: A tale of domination and independence. In NIPS, 2013. Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. Online learning with feedback graphs: Beyond bandits. In JMLR, pp. 23 35, 2015. Alon, N., Cesa-Bianchi, N., Gentile, C., Mannor, S., Man- sour, Y., and Shamir, O. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM J. Comput., 46(6):1785 1826, 2017. Audibert, J. Y., Munos, R., and Szepesvári, C. Exploration- exploitation trade-off using variance estimates in multiarmed bandits. In Theoretical Computer Science, 2009. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235 256, 2002a. Auer, P., Cesa-Bianchi, N., and Gentile, C. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64, 2002b. Buccapatnam, S., Eryilmaz, A., and Shroff, N. Stochastic bandits with side observations on networks. In SIGMETRICS, 2014a. Buccapatnam, S., Eryilmaz, A., and Shroff, N. B. Stochas- tic bandits with side observations on networks. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 14, pp. 289 300. ACM, 2014b. Burnetas, A. and Katehakis, M. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122 142, 1996. Caron, S., Kveton, B., Lelarge, M., and Bhagat, S. Lever- aging side observations in stochastic bandits. In UAI, 2012. Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006. Cohen, A., Hazan, T., and Koren, T. Online learning with feedback graphs without the graphs. ICML, 2016. Cortes, C., De Salvo, G., and Mohri, M. Learning with rejection. In ALT, 2016. Cortes, C., De Salvo, G., Gentile, C., Mohri, M., and Yang, S. Online learning with abstention. In 35th ICML, 2018. Cortes, C., De Salvo, G., Mohri, M., Gentile, C., and Yang, S. Online learning with sleeping experts and feedback graphs. In ICML, 2019. Frieze, A. M. On the independence number of random graphs. Discrete Mathematics, 81:171 175, 1990. Kocák, T., Neu, G., Valko, M., and Munos, R. Efficient learning by implicit exploration in bandit problems with side observations. In NIPS, pp. 613 621, 2014. Kocák, T., Neu, G., and Valko, M. Online learning with noisy side observations. AISTATS, 2016. Lai, T. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1): 4 22, 1985. Li, S., Chen, W., and Leung, K. Stochastic online learning with probabilistic graph feedback. In 34st AAAI Conference on Artificial Intelligence, 2020. Littlestone, N. and Warmuth, M. K. The weighted majority algorithm. Information and computation, 108(2):212 261, 1994. Liu, F., Buccapatnam, S., and Shroff, N. Information di- rected sampling for stochastic bandits with graph feedback. In 32nd AAAI Conference on Artificial Intelligence, 2018. Lykouris, T., Tardos, E., and Wali, D. Feedback graphs regret bounds for thompson sampling and ucb. ALT, 2020. Mannor, S. and Shamir, O. From bandits to experts: On the value of side-observations. NIPS, pp. 291 307, 2011. Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In NIPS, pp. 3168 3176, 2015. Salomon, A., Audibert, J., and El Alaoui, I. Lower bounds and selectivity of weak-consistent policies in stochastic multi-armed bandit problem. Journal of Machine Learning Research, 14:187 207, 2013. Tossou, A., Dimitrakakis, C., and Dubhashi, D. Thompson sampling for stochastic bandits with graph feedback. In 31st AAAI Conference on Artificial Intelligence, 2017. van de Geer, S. On hoeffding s inequality for dependent random variables. Dehling H., Mikosch T., Sørensen M. (eds) Empirical Process Techniques for Dependent Data, Wu, Y., György, A., and Szepesvari, C. Online learning with Gaussian payoffs and side observations. In Advances in Neural Information Processing Systems 28, pp. 1360 1368. Curran Associates, Inc., 2015a. Dependent Stochastic Feedback Graphs Wu, Y., György, A., and Szepesvári, C. Online learning with Gaussian payoffs and side observations. In NIPS, pp. 1360 1368, 2015b. Yun, D., Proutiere, A., Ahn, S., Shin, J., and Yi, Y. Multi- armed bandit with additional observations. Proc. ACM Meas. Anal. Comput. Syst., 2(1):13:1 13:22, 2018.