# cluster_agnostic_network_lasso_bandits__d1cafafb.pdf

Published in Transactions on Machine Learning Research (12/2025)

Cluster Agnostic Network Lasso Bandits

Sofien Dhouib sofiane.dhouib@tu-darmstadt.de Department of Computer Science, Technical University of Darmstadt Faculty of Science, University of Tübingen

Steven Bilaj steven.bilaj@ruhr-uni-bochum.de Faculty of Science, University of Tübingen Faculty of Electrical Engineering and Information Technology, Ruhr University Bochum

Behzad Nourani-Koliji behzad.nourani-koliji@uni-tuebingen.de Faculty of Science, University of Tübingen

Setareh Maghsudi setareh.maghsudi@ruhr-uni-bochum.de Faculty of Electrical Engineering and Information Technology, Ruhr University Bochum

Reviewed on Open Review: https: // openreview. net/ forum? id= Qj Ayo MP1DD

We consider a multi-task contextual bandit setting, where the learner is given a graph encoding relations between the bandit tasks. The tasks preference vectors are assumed to be piecewise constant over the graph, forming clusters. At every round, we estimate the preference vectors by solving an online network lasso problem with a suitably chosen, time-dependent regularization parameter. We establish a novel oracle inequality relying on a convenient restricted eigenvalue assumption. Our theoretical findings highlight the importance of dense intra-cluster connections and sparse inter-cluster ones. That results in a sublinear regret bound significantly lower than its counterpart in the independent task learning setting. Finally, we support our theoretical findings by experimental evaluation against graph bandit multi-task learning and online clustering of bandits algorithms.

1 Introduction

Online commercial websites aim to recommend their products to their customers properly, and the performance of these recommendations depends on the knowledge of users preferences. Unlike traditional collaborative-filtering-based methods (Su & Khoshgoftaar, 2009), such knowledge is initially unavailable. Therefore, the online recommender systems need to recommend various items to the users and observe their ratings to explore their preferences. At the same time, the recommender system should be able to recommend items that attract users attention and receive high ratings by exploiting the learned knowledge. The contextual bandit frameworks (Li et al., 2010) have been popularly used to formalize and address this exploration-exploitation trade-off.

However, the classical form of contextual bandits (Li et al., 2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011) ignores the availability of social networks amongst users and solves the problem for each user separately. Consequently, such algorithms have some drawbacks when applied to problems with a large number of users. First, such a large number hinders their computational efficiency. Second, the partial feedback of the bandit settings exposes the algorithms to having weak estimations and impairing their decision-making ability (Yang et al., 2020). Consequently, to improve bandit algorithms performance for large-scale applications, structural assumptions that link the different users are usually integrated within bandit algorithms (Cesa-Bianchi et al., 2013; Gentile et al., 2014; Li et al., 2019; Herbster et al., 2021).

Equal contribution

Published in Transactions on Machine Learning Research (12/2025)

Cesa-Bianchi et al. (2013) and Yang et al. (2020) use the prior knowledge of social networks into their contextual bandit algorithms. Both papers propose UCB-style algorithms and exhibit the importance of using the social network graph to achieve lower regrets using Laplacian regularization. The latter regularization promotes smoothness among the preference vectors of users, allowing the transfer of the collected information between them. However, the Laplacian regularization does not account for the smoothness heterogeneity introduced by a piecewise constant behavior over the graph (Wang et al., 2016). On the other hand, algorithms of online clustering of bandits (Gentile et al., 2014; Li et al., 2019) tackle such a piecewise constant behavior by explicitly estimating user clusters.

In this paper, we assume access to a graph encoding relations between bandit tasks, and that the task parameter vectors are piecewise constant over the graph. We propose an algorithm that integrates the prior knowledge of the piecewise constant structure to update tasks rather than finding the clusters explicitly. That way, we mitigate the limitations mentioned above: the piecewise constant smoothness is naturally integrated into our regularizer, and we do not estimate the clusters so our algorithm does not suffer from overconfidence drawbacks.

More precisely, we provide the following contributions

We analyze an instance of the Network Lasso problem (Hallac et al., 2015), estimating every vertex s preference vector using data generated during the interaction between users and the bandit. We provide the first oracle inequality in this setting and link it to fundamental quantities characterizing the relation between the graph and the true preference vectors of the users. Our result relies on our novel restricted eigenvalue (RE) condition, which we assume for our setting. This result is of independent interest and can be applied to i.i.d. data as a special case.

We prove that the empirical multi-task Gram matrix of the data inherits the RE condition from its true counterpart. Both this result and the previous one depend on the sparsity of inter-cluster connections and the density of intra-cluster ones.

We provide a regret upper bound for our setting. Our bound highlights the advantage of our algorithm in high dimensional settings, and for large graphs.

We support our theoretical findings by extensive numerical experiments on simulated data that prove the advantage of our algorithm over other related approaches.

The rest of the paper is organized as follows. Section 2 discusses the relation of our work to the literature. We formulate our problem and state some of our assumptions in Section 3, then present our bandit algorithm in Section 4. We analyze the problem theoretically in Section 5 and demonstrate its practical interest experimentally in Section 6.

2 Related work

Lasso contextual bandits. To address the high dimensional setting for linear bandits, several multiarmed bandit papers solve a LASSO (Tibshirani, 1996) problem under different assumptions (Bastani & Bayati, 2019; Kim & Paik, 2019; Oh et al., 2021; Ariu et al., 2022). They all rely on a previously established compatibility or RE condition (Bühlmann & van de Geer, 2011), that they adapt to the non-i.i.d case resulting from the context selection procedure across rounds. Such assumptions were also used in the multitask setting by Cella & Pontil (2021) with a Group Lasso regularization (Yuan & Lin, 2006), and to impose a low-rank structure on the task preference vectors in Cella et al. (2023). In our case, we establish a novel oracle inequality, rather than only generalize an existing one to the non-i.i.d setting, with a newly introduced RE assumption, which can be of independent interest.

Clustering of bandits. Gentile et al. (2014) introduced sequential clustering of bandits with the CLUB algorithm. The latter starts with a fully connected graph, and then an iterative graph learning process is performed, where edges between users are deleted if their preference vectors are significantly different. As a

Published in Transactions on Machine Learning Research (12/2025)

result, any connected component is seen as a cluster and only one recommendation per cluster is developed. The SCLUB algorithm of Li et al. (2019) generalizes CLUB via including merging operations in addition to splitting. In contrast to these approaches, Nguyen & Lauw (2014) groups users via K-means clustering, and Cheng et al. (2023) rely on hedonic games for online clustering of bandits. Furthermore, Yang & Toni (2018) make use of community detection techniques on graphs to find user clusters. Gentile et al. (2017) study the clustering of the contextual bandit problem where their proposed algorithm, named CAB, adaptively matches user preferences in the face of constantly evolving items. Our work fundamentally differs from the previous ones on two aspects. First, we assume access to a graph encoding relations between users, which is more informative than a complete graph. Second, we do not keep track of a model for each cluster, but rather we integrate a prior over the graph via a graph total variation regularizer that enforces a piecewise constant behavior for the estimated preference vectors.

Multi-task learning. Several contributions assume that the bandit tasks share some underlying structure. In Cella & Pontil (2021), task preference vectors are assumed to be sparse and to share their sparsity support, implying that they lie in a low-dimensional subspace with dimensions aligning with the canonical basis vectors. This idea is further generalized in Cella et al. (2023), where the tasks are assumed to be confined to an arbitrary unknown low-dimensional subspace. That work improves upon Hu et al. (2021) by not requiring the knowledge of the small dimension of the task space. It can be considered to solve our problem if the number of clusters is smaller than the dimension, resulting in a low-rank structure. However, our work does not rely on any assumption between the number of clusters and the dimension. The underlying structure linking tasks can also be a graph encoding relations between them (Cesa-Bianchi et al., 2013; Yang & Toni, 2018), which is our case. However, while they assume smoothness as a prior, we assume piecewise constant behavior.

Homophily and modularity in social networks Given the large number of users on social networks, one may be able to learn their preferences more quickly by leveraging the similarities between them. This idea relies on the notion of homophily in social networks (Mc Pherson et al., 2001; Easley et al., 2010). In modelling social networks, users preferences relationships are encoded in a graph, where neighboring nodes are users with similar preferences. This graph can be known a priori or it can be inferred from previously collected feedback (Dong et al., 2019). Exploiting this information and integrating them into bandit algorithms can lead to a significant increase in performance Yang et al. (2020). Indeed, the knowledge of user relations allows the algorithm to tackle the data sparsity issue that is inherent to bandit settings. Another fundamental point that can be used to integrate information from social networks is that, social networks show large modularity measures (Newman, 2006; Borge-Holthoefer et al., 2011). This implies that we have high density of edges within clusters and low density of edges between clusters. As a result, users can be clustered based on the graph topology and a preference vector can be learned for each cluster, substantially reducing the dimensionality of the problem. In other words, discovering the clustering structure of users can reduce the computational burden of large social networks. Consequently, there have been attempts in exploiting the clustered structures of social networks in bandit algorithms (Gentile et al., 2014; Nguyen & Lauw, 2014; Yang & Toni, 2018; Li et al., 2019; Nourani-Koliji et al., 2023; Cheng et al., 2023).

Bandit meta-learning In contrast to the multi-task setting, meta learning deals with sequentially arriving tasks that have to be learnt and generalizing the gained information to improve performance for future tasks. Here, as in the multi-task setting, it is assumed that the tasks share some common structure that is ought to be learnt and exploited. Bilaj et al. (2024) assume that the tasks are sampled from a common distribution and concentrated around an affine subspace learned through PCA algorithm. The resulting projection matrices could then be exploited to improve learning for new tasks in an adapted UCB and Thompson sampling approach. Other lines of work are Cella et al. (2020); Kveton et al. (2021); Basu et al. (2021), which learns the mean of the distribution under the assumption that the covariance of the prior is known or Peleg et al. (2022) which generalizes this assumption and attempts to learn the covariance as well.

Published in Transactions on Machine Learning Research (12/2025)

3 Problem setting

We consider a linear bandit setting, with a finite number of tasks representing users in a recommendation system for example. For each task the agent has to choose among K arms, each associated to a d-dimensional context vector. All interactions over a horizon of T time steps. We further assume that we have access to an undirected graph G = (V, E), with vertex set V representing the tasks and edge set E encoding the relationships between them. We identify the vertex set V with the set of vertex indices [|V|]. Thus, we consider E to be a subset of V2, where every edge (m, n) E has weight wmn > 0, with m < n. The tasks preference vectors are denoted by {θm}m V Rd verifying θm 1 m V, which we concatenate as row vectors into matrix Θ R|V| d. The latter represents a graph vector signal, assumed to be piecewise constant over G.

At a round t N , a user m(t) V is selected uniformly at random and served an arm with context vector x(t) from a finite action set A(t) Rd with size K, depending on their estimated preference vector ˆθm(t)(t) Rd. We assume the expected reward to be linear, with an additive σ-sub-Gaussian noise conditionally on the past. Formally, denoting by F0 the trivial sigma-algebra, and for all t 1, by Ft the sigma-algebra generated by history set {m(1), x(1), y(1), , m(t), x(t), y(t), m(t + 1)}, the received reward y(t) is given by y(t) = θm(t)(t), x(t) + η(t), where η(t) is Ft measurable and t 1, s R,

E [η(t)|Ft 1] = 0,

E [exp(sη(t))|Ft 1] exp 1

2σ2s2 . (1)

The performance of our policy is assessed by the expected regret over the T interaction rounds for all tasks:

t=1 max x A(t)

θm(t), x θm(t), x(t) #

The Optimization problem in Equation (4) is an instance of the Network Lasso (Hallac et al., 2015). Several instances of the same type were studied by Jung et al. (2018); Jung & Vesselinova (2019); Jung (2020); He et al. (2019). The objective is characterized by its second term which, while being just the Laplacian regularization without squaring the norms, promotes a piecewise constant behavior rather than smoothness. For real-valued signals (d = 1), this regularization has been extensively studied for image and graph signal denoising, for the problem of trend filtering on graphs (Wang et al., 2016). According to Wang et al. (2016), that regularization better adapts to the heterogeneity of smoothness of the signal and induces a cluster structure in the data: similar users will not only have similar models but the same model, which offers a compression of the overall model over the graph. Note that our setting is cluster agnostic; our algorithm does not aim to learn the cluster structure explicitly but to exploit it implicitly using the total variation semi-norm as regularization. The strength of the latter is controlled via a time-dependent regularization coefficient α(t), which we will express later in the analysis.

We formalize our assumption on the context generation as follows.

Assumption 1 (i.i.d action sets). Context sets {A(t)}T t=1 are generated i.i.d. from a distribution p over RK d, such that x 1 x A(t) t 1.

In addition to the i.i.d assumption, we assume more regularity as follows.

Assumption 2 (Relaxed symmetry and balanced covariance). There exists a constant ν 1 such that for all X RK d, p( X) νp(X). Furthermore, there exists ω > 0, such that for any permutation (a1, , a K) of [K], for any i {2, , K 1}, w Rd, we have

E xaix ai1[w xa1 < < w xa K] ωE (xa1x a1 + xa Kx a K)1[w xa1 < < w xa K] ,

where M N means that N M is a PSD matrix.

Published in Transactions on Machine Learning Research (12/2025)

This assumption was introduced in Oh et al. (2021), and has already been used in a multi-task setting by Cella et al. (2023). Parameter ν controls the skewness, as ν = 1 corresponds to a symmetric distribution. ω decreases with increasing positive correlation between arms. It verifies ω = O(1) for multi-variate Gaussians and uniform distributions over the unit sphere (Oh et al., 2021). The piecewise constant behavior of the graph signal Θ is formalized in the next assumption. Assumption 3 (Piecewise constant signal). There exists a partition P of V, such that for any cluster C P, signal Θ is constant on C, that is, θm = θn for all m, n C. The graph obtained by taking the vertices in C and the edges linking them is connected.

Assumption 3 basically states that the true preference vectors are clustered and that the given graph induces the cluster structure. It is required for our approach to be beneficial, as we will detail in the analysis section. For the sake of clarity, we defer the statement of other technical assumptions to Section 5.

4 Algorithm

Our policy in Algorithm 1 follows a greedy arm selection rule in a multi-task setting, in the same vein as those presented in Oh et al. (2021); Cella et al. (2023). Indeed, as pointed out in Oh et al. (2021), exploration is implicitly incorporated into regularization parameter α(t) s time dependence. It has the following expression

α(t) := α0σ

t + α1(t) + α2(t), (3)

m V |Tm(t)|2 log 1 δ(t),

α2(t) := 2 max m V |Tm(t)| log 1 δ(t),

where α0 > 0. The set of time steps a task m has been selected up to time t is denoted by Tm(t). At the end of a round t, all preference vectors are updated into a new estimation ˆΘ(t) while leveraging the structure of graph G, formally by solving the following network lasso optimization problem:

ˆΘ(t) = arg min Θ R|V| d 1 2t

θm(τ), x(τ) y(τ) 2 + α(t) X

(m,n) E wmn θm θn , (4)

where denotes the Euclidean norm for vectors. At each time step the network Lasso problem is solved via the primal-dual algorithm (Jung, 2020).

Algorithm 1: Network Lasso Policy Input: T, α0 > 0, G, δ Initialization: ˆΘ(0) = 0 R|V| d

for t {1, ..., T} do

Draw a user m(t) V uniformly at random Observe context set A(t)

Select x(t) arg max x A(t) D ˆθm(t 1), x E

Receive payoff y(t) Update α(t) via equation 3 Update ˆΘ(t) via equation 4

This section provides the main steps of the analysis. One of the paper s contribution lies in finding an oracle inequality of the network lasso problem given a restricted eigenvalue condition holding for the true multitask Gram matrix. In this regard, the next major challenge and contribution is to show that the empirical

Published in Transactions on Machine Learning Research (12/2025)

multi-task Gram matrix, estimated in the algorithm, satisfies the restricted eigenvalue condition. We start by proving an oracle inequality for the estimation error of Θ. Then, we prove that the latter assumption holds with high probability given that the true multi-task Gram matrix satisfies it. We end this section by establishing a regret bound for our algorithm.

5.1 Notation and technical assumptions

We provide additional notations required for the analysis. We denote by P the set of all edges in E connecting vertices from different clusters from partition P (Assumption 3), and we call it the boundary of P. Thus, Pc, the complementary set of P, is formed by edges connecting vertices of the same cluster. The total weight of the boundary, i.e.the sum of its edges weights, is referred to as w( P). Given a signal Z R|V| d, we denote by ZP the signal obtained by setting row vectors of Z to their meanper-cluster value w.r.t. P. For any edge subset I E, we denote the following norms: F as the Frobenius norm and Θ I := P

(m,n) I wmn θm θn as the total variation semi-norm of Θ R|V| d over I. Thus, the regularization term of Problem Equation (4) is equal to Θ E. Also, we define the incidence matrix BI R|E| |V|restricted to I E to be null except at rows with index i I corresponding to edge (m, n), where it equals wmn(em en), where em is the mth canonical basis vector of R|V|. We define AV(t) := diag X1(t) X1(t), . . . , X|V|(t) X|V|(t) Rd|V| d|V|, and subsequently the empirical multi-task Gram matrix up to time step t is given by 1

t AV(t). The following definition introduces quantities related to the clusters defined by partition P, with crucial roles that we will elucidate throughout the analysis.

Definition 1 (Cluster content constants). Let C P be a cluster.

We denote by v C the vertices of C that are connected to its complementary. We define the inner isoperimetric ratio of C as ιG(C) := | v C|

By abuse of notation, we denote as BC the incidence matrix restricted to edges linking vertices of C, its associated Laplacian matrix by LC := B C BC, and its pseudo-inverse by L C. The topological centrality index of node m C w.r.t C is equal to (L C) 1 mm. We define the topological centrality index of C by c G(C) := minm C(L C) 1 mm.

The inner isoperimetric ratio of a cluster measures how many interior nodes a cluster contains, in the sense that they are not connected to its complementary. It is at most equal to the isoperimetric ratio for weightless graphs as the size of the inner boundary is at most equal to that of the edge boundary, the latter being connected to the algebraic connectivity via the Cheeger inequality (Cheeger, 1970).

The topological centrality index measures the overall connectedness of a vertex in a network and indicates how robust a node is to edge failures (Ranjan & Zhang, 2013). Also, it can be tied to electricity spreading in a network according to Van Mieghem et al. (2017). We refer the interested reader to the two previously mentioned works for a detailed account of the properties of the topological centrality index. In the appendix, we show that for binary weights graphs the minimum topological centrality index is at least equal to the algebraic connectivity theoretically and experimentally, where we showcase that the difference between the two can be significant.

Remark 1. Both the topological centrality index and inner isoperimetric ratio are key parameters of the cluster structure and the graph. They determine the quality of the given graph. An optimal graph and cluster structure yield many intra-cluster connections and few inter cluster connections i.e. a high topological centrality index and low inner isoperimetric ratio for any cluster. This will later be highlighted in the oracle inequality and the regret bound.

To proceed, we will need the following definition that introduces several notations to reduce the clutter.

Definition 2 (Restricted Eigenvalue (RE) condition and norm). A PSD matrix M Rd|V| d|V| verifies the RE condition with constants κ 1, ψ > 0 and ϕ > 0 if

ϕ2 Z 2 RE vec(Z ) M vec(Z ) Z S, (5)

Published in Transactions on Machine Learning Research (12/2025)

where S is the cone defined by:

S := Z R|V| d; a1

G, Θ, 1 ψw( P)

G, Θ, 1 ψw( P)

a1(G, Θ, α0) := 1

1 α0 + 2κw( P)

c G(C) , a2(G, Θ, α0) := 1

2κw( P) max C P

and the RE semi-norm is defined by Z RE := ZP F .

For the rest of the paper, when we use a1 and a2 without arguments, we set α0 = 1 w( P)ψ in order to reduce clutter. For our main results, we cover the case of κ 1 but treat the more general case κ > 0 in the proofs in the supplementary material. For such a simplification to be valid, we need to assume that the graph satisfies min C P

c G(C) > 2w( P), such that we can always find constants κ 1 and ψ > 0 that guarantee

To explain the RE condition, if we had S = R|V| d and RE = F , then M would be invertible with minimum eigenvalue at least ϕ2. In comparison, our requirement is weaker since it only needs to hold for signals Z S and for the RE semi-norm. It has the same form as the compatibility assumption for the Lasso problem in (Bühlmann & van de Geer, 2011; Oh et al., 2021) or the restricted strong convexity assumption (Cella et al., 2023).

We further make the following assumption on the true multi-task Gram matrix:

Assumption 4 (RE condition for the true multi-task Gram matrix). For k [K], let Σk := E xkx k be the Gram matrix of the kth context vector s marginal distribution, let ΣV be the true multi-task Gram matrix of the context vector generating distribution, given by

ΣV := I|V| Σ, where Σ = 1

k=1 Σk. (6)

We assume that ΣV verifies RE condition (Definition 2) with some problem dependent constants κ 1, 1 2w( P) min C P

c G(C) , ψ 0, 1 w( P) min C P

c G(C) 2 and ϕ > 0.

This assumption is common to several Lasso-like bandit problems (Oh et al., 2021; Ariu et al., 2022; Cella et al., 2023).

We will later show that it can be transferred to the empirical multi-task Gram matrix.

Remark 2. The previous assumption implies the invertibility of the true mean gram matrix of any cluster. For the special case that all clusters are known, i.e. the graph has no inter cluster connections, we can fully exploit this with α , which amounts to the task learning of |P| instead of |V| independent OLS problems, since all users within a cluster share observed context vectors and can be viewed as a singular user. In order for the individual OLS problems to be solvable the true mean gram matrix per cluster needs to be invertible.

We provide further intuition on the constant ϕ within the RE condition. We can show that ϕ has an upper bound:

Proposition 1 (On the RE constant ϕ). Let Mi Rd d be the true multi-task gram matrix of user i. Assume κ 1. Then the constant ϕ of the RE condition can be upper bounded as:

where λmin( ) yields the minimum eigenvalue of a given matrix.

Published in Transactions on Machine Learning Research (12/2025)

Since the true multi-task gram matrix per cluster is always invertible, we always have a non-null minimal eigenvalue.

Remark 3. The minimal eigenvalue in Proposition 1 could be further bounded using the trace of the covariances i.e. the sum of all the eigenvalues over the dimension. This would result into an upper bound of ϕ2 1

5.2 Oracle inequality

This section is dedicated to provide a bound on the estimation error of the Network Lasso problem given in Equation (4) at a particular step t of Algorithm 1.We assume fixed design, meaning that the context vectors are given and fixed, and we are not concerned by their randomness (due to the context generating distribution), nor by the randomness of their number for each user (due to random selection at each time step).

For a time step t, we deliver the oracle inequality controlling the deviation between the estimated preference vectors ˆΘ(t) and the true ones Θ.

Theorem 1 (Oracle inequality). Assume that the RE assumption holds for the empirical multi-task Gram

t AV(t) with constants κ 1, 1 2w( P) min C P

c G(C) , ψ 0, 1 w( P) min C P

c G(C) 2 and ϕ > 0.

Suppose that maxm V |Tm(t)| bt for some b > 0 and α0 1 ψw( P). Then, with a probability at least 1 δ(t), we have

Θ ˆΘ(t) F 2 σα0

tf(G, Θ, α0)

v u u t1 + 2b

|V| log 1 δ(t) + 2b log 1 δ(t),

f(G, Θ, α0) := a2(G, Θ, α0)

a2(G, Θ, α0)

a1(G, Θ, α0) min C P

The proof relies on decomposing the estimation error signal into a sum of two terms. The first term amounts to taking its mean per cluster, that is, every node within the same cluster is mapped to the mean estimation error of its cluster. The second term is proven to be related to the incidence matrices of each cluster. The probabilistic statement comes from a high probability bound on the Euclidean norm of an empirical vector process associated with our problem, using a generalization of the Hanson-Wright inequality to the subgaussian case (Hsu et al., 2012, Theorem 2.1). Compared to the bound of Jung (2020, Theorem 1), we bound a norm of the estimation error rather than just the total variation semi-norm. Besides, due to the expressions of a1(Θ, G, α0) and a2(Θ, G, α0), the bound significantly decreases with the products w( P) min C P p

ι(C) and w( P) max C P c G(C) 1

2 , which are small enough for dense intra-cluster edge links and sparse inter-cluster ones. The bound on the oracle inequality clearly grows with κ and ψ, thus it is most beneficial if κ is close to 1 and ψ close to zero.

5.3 RE condition for the empirical multi-task Gram matrix

To establish the oracle inequality, we assumed that the RE condition holds for the empirical multi-task Gram matrix. In this section, we prove that this holds with high probability. To this end, we use the same strategy as in Oh et al. (2021); Cella et al. (2023). We prove that on the one hand, the empirical multi-task Gram matrix inherits the RE condition from its adapted counterpart since it concentrates around it. On the other hand, we show that the adapted Gram matrix verifies the RE condition due to Assumption 1, 2 and 4.

Theorem 2 (RE condition holding for the empirical multi-task Gram matrix). Under assumptions 2 and 4, let t 1, and let κ, ϕ be the constants from Assumption 4. Assume that maxm V |Tm(t)| bt. Then, for any

γ 0, 1 + a2

2 , the empirical multi-task Gram matrix 1

t AV(t) verifies the RE condition with constants

Published in Transactions on Machine Learning Research (12/2025)

κ, ψ and ˆϕ, where

with a probability at least equal to 1 6d|V| exp

3γ2 ϕ4(min C P( c G(C) c G(C)2)t

, where ϕ := ϕ

c G(C) := c G(C) |C| C P.

The proof follows a similar approach as in Oh et al. (2021); Cella et al. (2023); we prove that the RE condition transfers from the true multi-task Gram matrix to its adapted counterpart VV(t), defined as follows:

VV(t) = diag V1(t), , V|V|(t) , (8)

where Vm(t) = 1

τ Tm(t) E x(τ)x(τ) |Fτ 1 . (9)

This transfer relies on the work of Oh et al. (2021, lemma 10). The other step of the proof is showing that the empirical multi-task Gram matrix and VV(t) become close to each other with high probability after sufficiently many time steps, in the sense of a matrix norm induced by the RE semi-norm and the restriction to set S (Definition 2). The bound showcases a dependence on min C P c G(C) |C|, which is of the same order as |C| for a fully connected cluster with vertices C. It is also clear that the probability of satisfying the RE condition increases with a higher minimum centrality of a cluster.

5.4 Regret bound

To bound the regret, we bound the expected instantaneous regret for each round t 1. This bound relies on the oracle inequality holding and the RE condition being satisfied for the empirical Gram matrix, both with high probability. Thanks to Theorem 1 and Theorem 2, these two conditions are ensured.

Theorem 3. Let the mean horizon per node be T = T |V|. Under assumptions 1 to 4, the expected regret of the Network Lasso Bandit algorithm is upper bounded as follows:

R(T) O α0νωf(G, Θ, α0)

A log(d|V|) + p

A = 3γ2 min C P( c G(C) c2 G(C))

Our regret is mainly formed of two parts. The first one is the sublinear time-dependent term and represents the bulk of horizon dependence. Interestingly, it decreases as the topological centrality index grows with the graph size, which proves the importance of intra-cluster high connectivity.

The second significant term comes from ensuring the RE condition for the empirical multi-task Gram matrix, and can be interpreted as the number of time steps necessary for it to hold, as pointed out by Oh et al. (2021). It has a logarithmic dependence in the graph size and in the dimension, which is a characteristic of regret bound of the "lasso type". Also noteworthy is that the regret grows explicitly with log(d) only in the time-independent term, making our policy useful in high-dimensional settings. Though from Proposition 1 we can expect an implicit dependency on the dimension in the RE constant ϕ. Specifically, the lower bound on ϕ is an open problem that appears unsolved in other lasso based works such as Oh et al. (2021); Cella et al. (2023).

Both the regret bound and the oracle inequality presented in Theorem 1 hold only for the set of graphs that at least satisfy the condition min C P

c G(C) > 2w( P) and even though our results hold for a large set

Published in Transactions on Machine Learning Research (12/2025)

of graphs, the individual role of graph-related constants, encapsulated in f(G, Θ, α0), is not obvious. By further restricting the set of graphs, we are able to provide a simplified bound

Corollary 1. Assume w( P)(ψ+2κ)

c G(C) Ω, with some positive constant Ω< 1, then under assumptions 1 to 4,

the expected regret of the Network Lasso Bandit algorithm is upper bounded as follows:

R( T) = O 1 ϕ2(1 Ω) w( P) max C P ιG(C)

log T|V| + 4q

|V| log ( T|V|)

+ w( P)2 max C P ιG(C)

(1 Ω)2 min C P( c G(C) c2 G(C)) log(d|V|) + p

The simplified bound in Corollary 1 exhibits the typical multi-task learning dependency q

T|V| rather

than the independent task learning case |V|

T and highlights the role of graph related properties such as the total weight of the boundary, the maximal inner isoperimetric ratio and the minimal topological centrality index. Furthermore with Ωwe can see the influence on the regret bound, when w( P) changes relative to min C P

c G(C). Compared to the Graph UCB algorithm (Yang & Toni, 2018), that yields a regret

bound of O d|V| T and the trace norm bandit algorithm (Cella et al., 2023) with a regret bound of

r0d|V| T + |V| p

r0 T , with r0 < d as true rank of the multi task structure, our result shows an

improvement with respect to the number of nodes and horizon. The bound of the SCLUB algorithm given

in Li et al. (2019) yields an improved regret bound of O d q

|P||V| T that depends on the total number of

clusters |P|. While our algorithm exhibits a similar result, it does not explicitly depend on the number of clusters but with the properties of the graph and the compatibility with the underlying cluster structure. We present two additional specific instances: a graph with no boundary (i.e., w( P) = 0), and a graph where full connectivity is achieved within each individual cluster.

Corollary 2. Assume w( P) = 0, then under assumptions 1 to 4, the expected regret of the Network Lasso Bandit algorithm is upper bounded as follows:

R( T) = O 1

log T|V| + 4q

|V| log ( T|V|) + 1 min C P( c G(C) c2 G(C)) log(d|V|) + p

Corollary 3. Assume w( P)(ψ+2κ) min C P

c G(C) Ω, with some positive constant Ω< 1 and assume any two nodes

within the same cluster are connected through an edge in the given graph, then under assumptions 1 to 4, the expected regret of the Network Lasso Bandit algorithm is upper bounded as follows:

R( T) = O 1 ϕ2(1 Ω) w( P) max C P ιG(C)

log T|V| + 4q

|V| log ( T|V|)

+ w( P)2 max C P ιG(C)

(1 Ω2) min C P |C| log(d|V|) + p

6 Experiments

We compare our algorithm with α0 = 1 to several baselines of the literature. On the one hand, we consider baselines relying on a given graph, GOBLin (Cesa-Bianchi et al., 2013) and Graph UCB (Yang et al., 2020)

Published in Transactions on Machine Learning Research (12/2025)

that use the Laplacian to smooth the preference vectors. On the other hand, we compare to clustering of bandits baselines, namely CLUB (Gentile et al., 2014), SCLUB (Li et al., 2019), OLS-ITL (Bastani et al., 2021) and LOCB (Ban & He, 2021). We provided CLUB with graph G rather than a fully connected graph for a fair comparison. We also include the trace norm bandit algorithm (Cella et al., 2023), which is relevant when the number of clusters is smaller than d. Indeed, the cluster structure of Θ can be mathematically written as Θ = P

C P 1Cθ C, where 1C is the indicator vector of cluster C (coordinates equal to 1 on the nodes belonging to C and zeros elsewhere) and θC is the true vector of every node in C. The range of Θ is equal to the span of 1C; C P, implying that its rank is at most equal to min(d, |P|). It will then satisfy the low-rank assumption for |P| < d. As a sanity check, we compare to the independent task learning case with Lin UCB (Lin Ucb ITL) where each task is solved independently. The graph used is weightless and generated using a stochastic block model to ensure a cluster structure, where an edge is constructed with probability p within clusters and q between clusters.

Experimentally, we found that normalizing the weights as wmn = (deg(m) deg(n)) 1

2 , where deg(m) denotes the degree of node m, yields significantly better results. Indeed, such a normalization makes the algorithm focus more on edges between low-degree nodes, which improves the propagation of the collected information within the graph.

0 500 1000 1500 2000 2500 3000 3500 4000 time steps

cumulative regret

CLUB GOBLin Graph UCB LOCB Lin Ucb ITL Net Lasso OLS-ITL Trace-Norm

(a) |V| = 200, |P| = 25, d = 10, p = 0.5, q = 0.05

0 500 1000 1500 2000 2500 3000 time steps

cumulative regret

CLUB GOBLin Graph UCB LOCB Lin Ucb ITL Net Lasso OLS-ITL SCLUB Trace-Norm

(b) |V| = 50, |P| = 5, d = 80, p = 0.8, q = 0.2

0 200 400 600 800 1000 time steps

cumulative regret

CLUB GOBLin Graph UCB LOCB Lin Ucb ITL Net Lasso OLS-ITL SCLUB Trace-Norm

(c) |V| = 100, |P| = 8, d = 10, p = 0.5, q = 0.1

0 500 1000 1500 2000 2500 3000 time steps

cumulative regret

CLUB GOBLin Graph UCB LOCB Lin Ucb ITL Net Lasso OLS-ITL SCLUB Trace-Norm

(d) |V| = 100, |P| = 8, d = 20, p = 0.4, q = 0.1

Figure 1: Synthetic data experiments showing the cumulative regret of Network Lasso Policy as a function of time-steps compared to other baselines, for different choices of |V|, |P|, d, p and q.

Published in Transactions on Machine Learning Research (12/2025)

Our results clearly demonstrate an improvement compared to the other baselines. Our policy performs significantly better than the rest beyond the error margins, covering one standard deviation at ten repetitions. We provide results for up to |V| = 200 nodes showing the effective transfer of knowledge between nodes.

7 Conclusion and future perspectives

In this work, we proposed a multi-task bandit framework that solves the case where the task preference vectors are piecewise constant over a graph. To this end, we used the Network Lasso policy to estimate the task parameters, which bypasses explicit clustering procedures. We established a sublinear regret bound and proved a novel oracle inequality that relies on the small size of the boundary and the high value of the topological centrality index of each node within its cluster. Our experimental evaluations highlight the advantage of our method, especially when either the number of dimensions or nodes increases.

Due to the technical similarity of our problem with the Lasso, a natural extension would be to extend it to a thresholded approach, in the same vein as (Ariu et al., 2022). Another possible extension would be to use regularization with higher order total variation terms that impose a piecewise polynomial signal on a graph, as explained for scalar signals in Wang et al. (2016); Ortelli & van de Geer (2019).

Acknowledgments

This work was supported by the German Research Foundation (DFG) under Grant Nos. MA7111/6-1 and MA7111/7-1. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Steven Bilaj.

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2011.

Kaito Ariu, Kenshi Abe, and Alexandre Proutiere. Thresholded Lasso Bandit. In Proceedings of the 39th International Conference on Machine Learning, 2022.

Yikun Ban and Jingrui He. Local clustering in contextual multi-armed bandits. In Proceedings of the Web Conference 2021, 2021.

Hamsa Bastani and Mohsen Bayati. Online Decision Making with High-Dimensional Covariates. Operations Research, 2019.

Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly exploration-free algorithms for contextual bandits. Manage. Sci., 2021.

Soumya Basu, Branislav Kveton, Manzil Zaheer, and Csaba Szepesvari. No Regrets for Learning the Prior in Bandits. In Advances in Neural Information Processing Systems, 2021.

Steven Bilaj, Sofien Dhouib, and Setareh Maghsudi. Meta learning in bandits within shared affine subspaces. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, 2024.

Javier Borge-Holthoefer, Alejandro Rivero, Iñigo García, Elisa Cauhé, Alfredo Ferrer, Darío Ferrer, David Francos, David Iniguez, María Pilar Pérez, Gonzalo Ruiz, et al. Structural and dynamical patterns on online social networks: the spanish may 15th movement as a case study. Plo S one, 2011.

Peter Bühlmann and Sara van de Geer. Statistics for high-dimensional data. Springer Series in Statistics. Springer, Heidelberg, 2011.

Leonardo Cella and Massimiliano Pontil. Multi-task and meta-learning with sparse linear bandits. In Uncertainty in Artificial Intelligence, 2021.

Leonardo Cella, Alessandro Lazaric, and Massimiliano Pontil. Meta-learning with stochastic linear bandits. In International Conference on Machine Learning, 2020.

Published in Transactions on Machine Learning Research (12/2025)

Leonardo Cella, Karim Lounici, Grégoire Pacreau, and Massimiliano Pontil. Multi-task representation learning with stochastic linear bandits. In International Conference on Artificial Intelligence and Statistics, 2023.

Nicolo Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. A gang of bandits. Advances in neural information processing systems, 2013.

Jeff Cheeger. A lower bound for the smallest eigenvalue of the laplacian. Problems in analysis, 1970.

Xiaotong Cheng, Cheng Pan, and Setareh Maghsudi. Parallel online clustering of bandits via hedonic game. In International Conference on Machine Learning, 2023.

Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011.

Xiaowen Dong, Dorina Thanou, Michael Rabbat, and Pascal Frossard. Learning graphs from data: A signal representation perspective. IEEE Signal Processing Magazine, 2019.

David Easley, Jon Kleinberg, et al. Networks, crowds, and markets: Reasoning about a highly connected world. Cambridge university press Cambridge, 2010.

Angela Fontan and Claudio Altafini. On the properties of laplacian pseudoinverses. In 2021 60th IEEE Conference on Decision and Control (CDC), 2021.

Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In International Conference on Machine Learning, 2014.

Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Giovanni Zappella, and Evans Etrue. On context-dependent clustering of bandits. In International Conference on machine learning, 2017.

David Hallac, Jure Leskovec, and Stephen Boyd. Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015.

Xiao He, Francesco Alesiani, and Ammar Shaker. Efficient and scalable multi-task regression on massive number of tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, 2019.

Mark Herbster, Stephen Pasteris, Fabio Vitale, and Massimiliano Pontil. A gang of adversarial bandits. Advances in Neural Information Processing Systems, 2021.

Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 2012.

Jiachen Hu, Xiaoyu Chen, Chi Jin, Lihong Li, and Liwei Wang. Near-optimal representation learning for linear bandits and linear rl. In International Conference on Machine Learning, 2021.

Alexander Jung. Networked Exponential Families for Big Data Over Networks. IEEE Access, 2020.

Alexander Jung and Natalia Vesselinova. Analysis of network lasso for semi-supervised regression. In The 22nd International Conference on Artificial Intelligence and Statistics, 2019.

Alexander Jung, Nguyen Tran, and Alexandru Mara. When Is Network Lasso Accurate? Frontiers in Applied Mathematics and Statistics, 2018.

Gi-Soo Kim and Myunghee Cho Paik. Doubly-robust lasso bandit. Advances in Neural Information Processing Systems, 2019.

Branislav Kveton, Mikhail Konobeev, Manzil Zaheer, Chih-wei Hsu, Martin Mladenov, Craig Boutilier, and Csaba Szepesvari. Meta-thompson sampling. In International Conference on Machine Learning, 2021.

Published in Transactions on Machine Learning Research (12/2025)

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 2010.

Shuai Li, Wei Chen, and Kwong-Sak Leung. Improved algorithm on online clustering of bandits. ar Xiv preprint ar Xiv:1902.09162, 2019.

Miller Mc Pherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual review of sociology, 2001.

Mark EJ Newman. Modularity and community structure in networks. Proceedings of the national academy of sciences, 2006.

Trong T Nguyen and Hady W Lauw. Dynamic clustering of contextual multi-armed bandits. In Proceedings of the 23rd ACM international conference on conference on information and knowledge management, 2014.

Behzad Nourani-Koliji, Steven Bilaj, Amir Rezaei Balef, and Setareh Maghsudi. Piecewise-stationary combinatorial semi-bandit with causally related rewards. In ECAI 2023. IOS Press, 2023.

Min-Hwan Oh, Garud Iyengar, and Assaf Zeevi. Sparsity-Agnostic Lasso Bandit. In Proceedings of the 38th International Conference on Machine Learning, 2021.

Francesco Ortelli and Sara van de Geer. Synthesis and analysis in total variation regularization. ar Xiv preprint ar Xiv:1901.06418, 2019.

Amit Peleg, Naama Pearl, and Ron Meir. Metalearning linear bandits by prior update. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, 2022.

Gyan Ranjan and Zhi-Li Zhang. Geometry of complex networks and topological centrality. Physica A: Statistical Mechanics and its Applications, 2013.

Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering techniques. Advances in artificial intelligence, 2009.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 1996.

Joel Tropp. Freedman s inequality for matrix martingales. Electronic Communications in Probability, 2011.

Piet Van Mieghem, Karel Devriendt, and H Cetinay. Pseudoinverse of the laplacian and best spreader node in a network. Physical Review E, 2017.

Yu-Xiang Wang, James Sharpnack, Alexander J. Smola, and Ryan J. Tibshirani. Trend filtering on graphs. Journal of Machine Learning Research, 2016.

Kaige Yang and Laura Toni. Graph-based recommendation system. In 2018 IEEE Global Conference on Signal and Information Processing (Global SIP), 2018.

Kaige Yang, Laura Toni, and Xiaowen Dong. Laplacian-regularized graph bandits: Algorithms and theoretical analysis. In International Conference on Artificial Intelligence and Statistics, 2020.

Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2006.

Published in Transactions on Machine Learning Research (12/2025)

A Some helper results

Proposition 2 (Bounds on norms of matrix products). Let M Rm n and N Rn p. Then

MN q,1 M ,1 N q,1 q [1, ]

M M , N 2,1

MN 2,1 M 2,1 N

First inequality For any q [1, ], we have:

j=1 eje j N

max 1 j n e i Mej n X

e j N q = max 1 j n |(M)ij| N q,1

Second inequality We have

j=1 M Nej 2 = M N 2 F

Third inequality We have

MN 2 F = Tr(MNN M ) M M , NN 1,1

Elements of (i, j) entry of matrix NN is the inner product e i N, e j N . Hence, we have

e i N, e j N X

e i N e j N = N 2 2,1.

Fourth inequality We have

i=1 ei M N = M 2,1 N

Proposition 3 (Decomposition of a signal over a graph). For any C P

Let Z R|V| d be a graph signal. Let us denote by ZC the signal obtained from Z by setting rows of vertices outside of C to zeros, and let Z|C R|C| d be the signal obtained from ZC by removing the rows of vertices outside of C. Also, let B|C R|EC| |C| be the matrix obtained by taking BC, and removing rows of edges that link C to its outside, and the resulting null columns. It is clear that

BCZ = BCZC = B|CZ|C (10)

Published in Transactions on Machine Learning Research (12/2025)

Let QC := B CBC. Then

C P JC + QC (11)

Q Pc := B Pc B Pc = X

C P QC (12)

where JC = 1C1 C |C| , QC = B CBC C P and Q Pc := B Pc B Pc.

C P JC projects each entry of a graph signal onto the mean vector value of its respective cluster, its residual Q Pc can be interpreted as the projection onto the respective entries deviation from its cluster mean value.

Proof. Since the proof of the first point is trivial, we directly treat the second point. Denoting B |C the

pseudo-inverse of B|C it is a well-known linear algebra result that the matrix Q|C := B |CB|C is the projector onto the null space of B|C. Since C is connected, the null space of B|C is unidimensional, and is generated by vector 1|C| R|C| having only ones as coordinates. Since the projector into that null space is J|C| := 1|C|1|C|

|C| , we deduce that

Z|C = J|C|Z|C + Q|CZ|C = ZC = JCZC + QCZC = JCZ + QCZ

where in the last line, QC := B CBC. Consequently, we have

C P JCZ + QCZ

To prove the second point, we recall that B Pc is the incidence matrix obtained by setting rows corresponding to edges in P to zero. In other words, B Pc is the incidence matrix of the graph after removing the boundary edges, and having exactly |P| connected components. Hence, B Pc has a null space spanned by the set {1C}C P, and the orthogonal projector onto this null space is P

C P JC. Combining this fact with the fact that Q Pc is the projector onto the orthogonal of the null space of B Pc, we arrive at the second point.

Proposition 4 (On the minimum topological centrality index of a graph vertex). Let G be a connected graph with incidence matrix B and vertex set size N, and let L := B B. Let c(G) denote the minimum value of inverses of diagonal element of L , called its minimum topological centrality index. Also let a(G) be its algebraic connectivity, defined as the minimum non null eigenvalue of L. Then

c(G) = L 1 , .

If G is weightless, then c(G) N 2 N 1.

Proof. Since L is PSD, L is PSD and hence L , is equal to the maximum diagonal entry of L . Taking the inverse proves the first point. Also, this implies that

, L 1 = a(G), (13)

Published in Transactions on Machine Learning Research (12/2025)

where we used the fact that , for matrices. This proves the second point of the proposition.

For the last point, assume G is weightless, let Lcomp be the Laplacian of complete graph built on the vertices of G. Then we have Lcomp = N(IN JN), where J is the square matrix of dimension N having 1/N as entries. From Fontan & Altafini (2021, Lemma 4), we have

L comp = (Lcomp + NJN) 1 1

which has diagonal elements 1 N 1 N 2 .

On the other hand, L Lcomp Hence, by Fontan & Altafini (2021, lemma 4) we have for any u = 0

L = (L + a JN) 1 JN/a (Lcomp + a JN) 1 JN/a = L comp

This implies that the maximum diagonal entry of L is at least equal to that of L comp, i.e.to 1

N 1 N2 . Taking the inverse of that entry finishes the proof.

B Proofs of the different claims

B.1 Additional notation

The regularization term can be written more compactly using the incidence matrix of the graph B R|E| |V|

corresponding to an arbitrary orientation under the following form

1 m<n |V| wmn θm θn = BΘ 2,1 = Θ E (15)

where the 2,1 norm denotes the sum of the L2 norms o the rows of a matrix.1 We provide notations that we use in the proofs of the different statements, in order to reduce the clutter. We define E := ˆΘ Θ as the error signal and its rows by {ϵm}|V| m=1.

While PC k=1 JC projects each entry of a graph signal onto the mean vector value of its respective cluster, its residual Q Pc can be interpreted as the projection onto the respective entries deviation from its cluster mean value.

Let ηm be a vector, vertically concatenated by noise terms of rewards received by node m, then we define K R|V| d as the matrix of vertically concatenated row vectors η m Xm.

N.B. Except for the results concerning the regret bound, we consider the case κ 0 rather than κ 1 in our proofs.

B.2 Oracle inequality

In this section, we present all intermediary theoretical results leading to Theorem 5 stating the oracle inequality. To reduce clutter, we omit the dependence on t of several quantities. For instance, we write α and ˆΘ instead of α(t) and ˆΘ(t).

Definition 3 (Restricted Eigenvalue (RE) condition and norm, generalization of Definition 2). Let {Mi}|V| i=1 Rd d be a set of positive semi-definite matrices. We say that the matrix MV := diag(M1, , M|V|) verifies the restricted eigenvalue condition with constants κ 0 and ϕ > 0 if

ϕ2 Z 2 RE X

i V zi 2 Mi Z S with rows {zi}i V,

1It is possible that the notation 2,1 denotes the sum of 2 norms of columns in the literature.

Published in Transactions on Machine Learning Research (12/2025)

Table 1: Notation table.

Notation Meaning

Independent of time t

V set of graph vertices E set of graph edges BI R|E| |V|, I E Graph incidence Matrix obtained by setting rows of edges outside I to zeros BC R|E| |V| cf. Definition 1 L R|V| |V| B B θm Rd true preference vector of user/bandit m Θ R|V| d matrix of true vertically concatenated row preferences vectors P E Boundary of P: set of edges connecting nodes from different clusters c G(C) Minimum topological centrality index of a node in C restricted to the graph in C w( P) Total weight of P, i.e. sum of weights of edges in P Euclidean norm for vectors, largest singular value for matrices A Semi-norm defined by PSD matrix A: x 2 A := x Ax F matrix Frobenius norm p,q q-norm of the vector with coordinates equal to the p norm of rows I, I E Total variation norm of signal over edges of I A Moore-Penrose pseudo-inverse of matrix A vec vectorization operator consisting in concatenating the columns vertically Kronecker product 1C R|V| Vector with entries corresponding to vertices in C equal to 1 and 0 elsewhere

JC R|V| |V| equal to 1C1 C |C| QC R|V| |V| equal to B CBC QI R|V| |V|, I E equal to B IBI ek elementary vectors of dimension depending on the context σ Subgaussianity constant / variance proxy

Dependent on time t

Tm(t) set of time steps user m has been encountered before time t ˆθm Rd estimated preference vector of user/bandit m ϵm Rd estimation error for user/bandit m : ˆθm θm E R|V| d vertical concatenation of row vectors ϵm ηm R|Tm(t)| vector of subgaussian noise of user m x(t) Rd context vector received at time t m(t) N user at time t Xm R|Tm(t)| d data matrix of user m X Rt d data matrix of context vectors of all users Am Rd d X m Xm (potentially associated to time t) AV Rd|V| d|V| diag(A1, , Am) K R|V| d matrix of vertically concatenated row vectors η m Xm

where S is the cone defined by:

S := {Z R|V| d; a1

G, Θ, 1 ψw( P)

G, Θ, 1 ψw( P)

ZP F + (1 κ)+ Z P},

a1(G, Θ, α0) := 1

1 α0 + 2κw( P)

c G(C) , a2(G, Θ, α0) := 1

2κw( P) max C P

Published in Transactions on Machine Learning Research (12/2025)

and the RE semi-norm is defined by Z RE := ZP F (1 κ)+ B PB PZ We have the structure

dependent unknown constants ψ and κ, for which we assume they guarantee a1 G, Θ, 1 ψw( P) > 0.

Proof of Proposition 1. Let Z = 1Cv be a constant per cluster signal, with 1C R|V| as indicator vector with the ith entry equal to 1 if i C and 0 otherwise. Then Z is contained in any cone S defined in the RE condition and we have:

ZC 2 F = Z 2 F = 1Cv v1 C

= 1 C 1C v 2

For the right hand side of the RE condition we have:

i V zi 2 Mi = vec(Z ) M vec(Z )

= vec(1Cv ) M vec(1Cv )

i V eie i Mi

i V eie i Mi

i V 1 C eie i 1C v Miv

i C v Miv = v X

Plugging the results into the RE condition, we get:

= ϕ2|C| v 2 v X

= ϕ2 v P i C Mi v

Lemma 1 (A first deterministic inequality). Let t be a time step. We have

m V Xmϵm 2 + E Pc 1

tα K, E + E P (16)

Proof. By optimality of ˆΘ, we have

Xm ˆθm ym 2 + α Θ E 1

m V Xmθm ym 2 + α Θ E (17)

Published in Transactions on Machine Learning Research (12/2025)

where the second line holds by definition of the observed rewards.

On the one hand, given a user index m V, and since by definition of the observed rewards we have we have for the least squared terms

Xm ˆθm ym 2 = Xm ˆθm Xmθm ηm 2

= Xmϵm ηm 2

= Xmϵm 2 + Xmθm ym 2 η m Xmϵm

where we used the fact that ym = Xmθm +ηm, which holds by definition of the observed rewards. Summing over the users, and using the definition of K, we have

Xm ˆθm ym 2 1

m V Xmθm ym 2 = 1

m V Xmϵm 2 1

t K, E (18)

On the other hand, we have for the estimated preference vectors

(m,n) E wmn ˆθm ˆθn

(m,n) P wmn ˆθm ˆθn + X

(m,n) Pc wmn ˆθm ˆθn

= ˆΘ P + ˆΘ Pc,

For the true ones, and for any C P, let EC denote the edges linking the nodes of set of nodes C. It is clear that Pc = S

C P EC as a disjoint union, hence

(m,n) E wmn θm θn

(m,n) P wmn θm θn + X

(m,n) Pc wmn θm θn

(m,n) EC wmn θm θn

where the last equality holds due to the cluster assumption.

Hence, we have

Θ E Θ E = Θ P ˆΘ P ˆΘ Pc

E P ˆΘ Pc, (19)

where the first inequality holds due to the triangle inequality, and the last one since Θ Pc = 0. Combining Equations (17) to (19), we obtain the result of the statement.

In the proof for the oracle inequality, we utilize projection operators on the graph signal, which we define as follows:

While PC k=1 JC projects each entry of a graph signal onto the mean vector value of its respective cluster, its residual Q Pc can be interpreted as the projection onto the respective entries deviation from its cluster mean value.

Published in Transactions on Machine Learning Research (12/2025)

Lemma 2 (Bounding the error restricted to the boundary). The total variation of E restricted to the boundary verifies

ιG(C) EP F + 2 E Pc

Proof. The proof relies on a decomposition of the E P term from Proposition 3. We have

C P JCE + QCE

= EP + B Pc B Pc E P

EP P + B Pc B Pc E P (21)

where EP is obtained by setting the error signal on every cluster to its mean.

For the first term on the right-hand side, let us denote by ϵC the value of any row of EP belonging to cluster C, which is equal to the mean of errors E over that cluster. Also, we denote by (EP) P the signal obtained from EP by setting its rows corresponding to nodes that are not adjacent to any edge in the boundary P to zeros. Also, let v C denote the inner boundary of set of nodes C,i.e. nodes of C that connect it to its complementary. Then it holds that:

EP P = B PEP 2,1 = B P(EP) P 2,1 B P 2,1 (EP) P (by Proposition 2)

B P 2,1 (EP) P F

C P | v C| ϵC 2

|C| |C| ϵC 2

B P 2,1 max C P

C P |C| ϵC 2

2w( P) max C P

ιG(C) EP F (22)

Published in Transactions on Machine Learning Research (12/2025)

For the second term, we have

B Pc B Pc E P = B PB Pc B Pc E 2,1

B PB Pc ,1 E Pc

B PB Pc F E Pc

(B Pc) B P F E Pc

r B Pc(B Pc) , E Pc (by Proposition 2)

B P 1,1 min C P

c G(C) E Pc.

c G(C) E Pc. (23)

The result is obtained by combining Equations (21) to (23).

Theorem 4 (Theorem 2.1 of Hsu et al. (2012)). At time step t, let A Rb t where b N , and let v Rt

be a random vector such that for some σ 0, we have

E [exp( u, v )] exp u 2 σ2

Then for any δ (0, 1), we have with a probability at least 1 δ:

A 2 F + 2 A A F

δ + 2 A 2 log 1

Lemma 3 (Empirical process bound). Let Xm R|Tm| d denotes the matrix of collected context vectors for task m V, then, given collected context matrices {Xm}m V, for any δ (0, 1) we have with probability of at least 1 δ:

αδ(t) := α0σ

v u u tt + 2

m V |Tm(t)|2 log 1

δ + 2 max m V |Tm(t)| log 1

Proof. We recall that K Rt d is the matrix obtained by stacking the row vectors η m Xm vertically. On the one hand, we have

X mηm 2 = X V η 2, (25)

where XV := diag(X1, , X|V|) Rt d|V| .

Published in Transactions on Machine Learning Research (12/2025)

On the other hand, for any u = (u1, , ut) Rt, denoting P(t) := exp Pt τ=1 uτητ , we have

E [P(t)] = E [E [exp{utηt}P(t 1)|Ft 1]] (by the law of total expectation)

= E [P(t 1)E [exp(utηt)|Ft 1]] (because {ηs}t 1 s=1 are Ft 1 measurable.)

E [P(t 1)] (by the conditional subgaussianity assumption)

(by induction)

2σ2 u 2 . (26)

From Equations (25) and (26), we can apply Theorem 4 to matrix XV and random vector η, which implies that with a probability at least 1 δ, we have

m V Am 2 F log 1

δ + 2 max m V Am log 1

where we used the equalities XV F = P

m V Tr(Am), XV 2 = max m V Am and XVX V 2 F = X V XV 2 F = P

m V Am 2 F . To arrive the statement of the theorem, we use the fact that the context vectors have Euclidean norms of at most 1.

Proposition 5 (Probabilistic inequality). With a probability at least 1 δ, we have

m V Xmϵm 2 + a1(G, Θ, α0) E Pc a2(G, Θ, α0) EP F + (1 κ) E P, (27)

where 0 κ < min C P

Proof. The proof is a combination of the results of Lemmas 1 to 3. We have

m V Xmϵm 2 + E Pc

1 tαδ K, E + E P (by Lemma 1)

α0 E F + κ E P + (1 κ) E P (by Lemma 3)

EP F α0 + E Pc

c G(C) + κw( P)

ιG(C) EP F + 2 E Pc

+ (1 κ) E P,

where the last line is an application of Lemma 2. Grouping the terms by the type of norm applied to E finishes the proof.

Theorem 5 (Oracle inequality, generalization of Theorem 1). Assume that the RE assumption

holds for the empirical multi-task Gram matrix with constants κ 1, 1 2w( P) min C P

c G(C) , ψ 0, 1 w( P) min C P

c G(C) 2 and ϕ > 0. Suppose that maxm V |Tm(t)| bt for some b > 0 and α0 1 ψw( P).

Published in Transactions on Machine Learning Research (12/2025)

Then, with a probability at least 1 δ(t), we have

Θ ˆΘ(t) F 2 σα0

tf(G, Θ, α0)

v u u t1 + 2b

|V| log 1 δ(t) + 2b log 1 δ(t),

f(G, Θ) := a2(G, Θ) +

21 1(κ)w( P)

21 1(κ)w( P)

a1(G, Θ) min C P

Proof. Using the previously established results, we obtain

m V Xmϵm 2 + α E Pc

αδa2(Θ, G) EP F + αδ(1 κ)+ E P (by Proposition 5)

=αδa2(Θ, G) EP F + αδ(1 κ)+ B PB PB PE 2,1 (by properties of the pseudo-inverse)

αδa2(Θ, G) EP F + αδ B P 2,11 1(κ)(1 κ)+ B PB PE (by Proposition 2)

αδ(a2(Θ, G) + 1 1(κ)

2w( P)) E RE (by definition of the RE norm)

αa2(Θ, G) + 1 1(κ)

m V ϵm 2 Am (using the RE assumption)

βα2 δ(a2(Θ, G) + 1 1(κ) B P 2,1)2

2ϕ2 + 1 2βt

m V Xmϵm 2, (28)

where the last inequality holds for any β > 0, and is a consequence of the property that uv u2 + v2

2 for any u, v R. In the second to last inequality we used the RE assumption, here it is important to mention that the assumption does not hold for any choice of α0. In the definition of S i.e. the set matrices for which the RE condition holds, we have α0 = 1 ψw( P). We can also observe that this set is non increasing for the inclusion operator i.e. the RE condition would become weaker, for increasing α0. Thus for any α0 1 ψw( P) the respective set of the RE assumption is contained in S and any matrix contained in the smaller set is automatically contained in S, allowing us to use the RE condition in the proof due to our lower bound on α0 1 ψw( P).

As a result, we can bound the norm of Q Pc E as follows:

Q Pc E F = B Pc B Pc E F

r L Pc , E Pc

2αδ(a2(Θ, G, α0) + 1 1(κ) B P 2,1)2

ϕ2a1(Θ, G, α0) min C P

c G(C) (Equation (28) with β = 1). (29)

We can also bound the norm of EP as follows:

EP 2 F 1 tϕ2 X

m V Xmϵm 2 (by RE assumption on empirical multi-task Gram matrix)

4α2 δ(a2(Θ, G, α0) + 1 1(κ) B P 2,1)2

ϕ4 (by Equation (28) with β = 2). (30)

Published in Transactions on Machine Learning Research (12/2025)

The result is then obtained by combining Equations (29) and (30) along with using the fact that E = EP + Q Pc E and the expressions of a1(Θ, G, α0) and a2(Θ, G, α0), and bounding αδ(t) as follows:

m V Xm 2 F + 2

m V Xm X m 2 F log 1

δ + 2 max m V Xm 2 log 1

m V |Tm(t)|2 log 1

δ + 2 max m V |Tm(t)| log 1

δ + 2t log 1

B.3 Inheriting the RE condition from the true to the empirical data Gram matrix

B.3.1 From the adapted to the empirical multi-task Gram matrix

Lemma 4 (Bounding a quadratic form using projections). Let M1, , Mp Rd d be symmetric matrices, and let J := 1

p11 , and Q = I J. Then, for any Z Rp d with rows {zi}p i=1, we have:

i=1 z i Mizi

Z Q Z J + max 1 i p Mi Z 2 Q

Proof. We have

i=1 z i Mizi

i=1 z Mi z + 2

i=1 (zi z) Mi z +

i=1 (zi z) Mi(zi z)

i=1 e i QZMi z

i=1 e i QZMi Z Qei

where we used the fact that zi z = Z ei Z Jei = Z Qei.

Let us now examine every term on the right-hand side of Equation (31). For the first term, we have

Z 2 J. (32)

Published in Transactions on Machine Learning Research (12/2025)

For the second term, we have

i=1 e i QZMi z

i=1 Mi Z Qei

i=1 (e i Mi) vec(Z Q)

i=1 (e i Mi)

i=1 (e i Mi)

i=1 (e i Mi)) p X

i=1 (e i Mi)

j=1 (e i Mi))(ej Mj)

j=1 (e i ej Mi Mj)

QZ F z . (33)

Finally, for the last term, we have

i=1 e i QZMi Z Qei

i=1 Mi Z Qei 2

max 1 i p Mi

= max 1 i p Mi QZ 2 F . (34)

Combining Equations (32) to (34) yields the result.

We also define an operator norm that is induced by the RE introduced in Definition 3.

Definition 4 ((RE,S)-induced operator norm). Let {Mm}m V Rd d be symmetric matrices associated to the graph nodes V, and let MV := diag M1, , M|V| Rd|V| d|V|. For any cluster C P, let the cluster mean and mean of squares associated to those matrices be given by

m C Mm, M 2C := 1

The RE-induced operator norm of MV is defined as

M RE,S := max C P M C r

min C P c G(C) 1 max C P

M 2C min C P c G(C) 1 max m V Mm . (35)

Published in Transactions on Machine Learning Research (12/2025)

B.3.2 Linking the adapted to the empirical Gram

We first start by establishing that given the closeness of two PSD matrices in a certain sense, the RE condition can be transferred between them. For the sake of readability we remove the arguments of the constants: a1 = a1 G, Θ, 1 ψw( P) , a2 = a2 G, Θ, 1 ψw( P) ,

Proposition 6 (Restricted spectral norm). Let Z R|V| d verifying

a1 Z Pc a2 ZP F + (1 κ)+ Z P

Let {Mm}m V Rd d be symmetric matrices associated to the graph nodes V, and let MV := diag(M1, , M|V|) Rd|V| d|V|. Then we have:

m V z m Mmzm

1 + a2 + (1 κ)+ B P 2,1

Z 2 RE. (36)

Proof. For any cluster C, we denote by BC the incidence matrix obtained by setting the rows of B outside the edges linking nodes in C to null vectors. The latter s null space is the span of the vector 1C having coordinates 1 at nodes in C and zeros elsewhere. Hence, the projector onto the orthogonal of 1C is QC := B CBC.

On the one hand, for any signal Z R|V| d we have

C P BCZ 2,1

B CBCZ F r L C ,

Hence, by the proposition s assumptions, Z verifies

C P Z QC (a2 ZP F + (1 κ) Z P)

a2 ZP F + (1 κ)+ B P 2,1 B PB PZ

(a2 + (1 κ)+ B 2,1) Z RE

From Lemma 4, we have

m V z m Mmzm

m C z m Mmzm

M C Z 2 JC + 2 X

r M 2C Z QC Z JC + X

C P max m C Mm Z 2 QC, (37)

where we used Equation (10).

Published in Transactions on Machine Learning Research (12/2025)

This allows us to bound every term in Equation (37). For the second term on the right-hand side, we have

r M 2C Z QC Z JC

r M 2C ZP F

min C P c G(C) 1

r M 2C (a2 + (1 κ)+ B 2,1) Z 2 RE (38)

As for the third term, we have

C P max m C Mm Z 2 QC max m V Mm

max m V Mm min C P c G(C) 1

a2 1 (a2 + (1 κ)+ B 2,1)2 Z 2 RE (39)

Consequently, denoting v = a2 + (1 κ)+ B 2,1

a1 , and combining Equations (37) to (39), we obtain

m V z m Mmzm

max C P M C + 2v max C P

r M 2C + v2 max i V Mi

max C P M C ) r

min C P c G(C) 1 max C P

M 2C min C P c G(C) 1 max i V Mi (1 + v)2 Z 2 RE,

which finishes the proof.

Proposition 7 (Inheritance of a RE condition from a close matrix). Assume that the matrix VV

verifies the RE condition with constant ϕ > 0, and that

op,RE γϕ2 for some γ

0, 1 + a2+(1 κ)+

2 . Then AV

t verifies the RE condition with constant

1 γ 1 + a2 + (1 κ)+

Proof. From Proposition 5, we know that

Published in Transactions on Machine Learning Research (12/2025)

t ϵ V AVϵV = 1

|V|ϵ V VVϵV + ϵ V VϵV

|V|ϵ V VVϵV ϵ V VϵV

ϕ2 max m V V op,RE

1 + a2 + (1 κ)+ B P 2,1

1 + a2 + (1 κ)+ B P 2,1

where the third inequality is an applicaiton of Proposition 6.

Theorem 6 (Matrix Freedman Inequality, Tropp (2011)). Consider a matrix martingale {M(t)}t 1 with dimension d1 d2. Let {N(t)}t 1 be the associated difference sequence. Assume that for some A > 0, we have N(t) A t 1 almost surely. Define for any t 1:

τ=1 E N(τ)N(τ) |Fτ 1

τ=1 E N(τ) N(τ)|Fτ 1 .

Then, for any u, v > 0,

P [ t 1; M(t) u and Wcol (t) Wrow(t) v] (d1 + d2) exp 3u2

Corollary 4. Let {N(τ)}t τ=1 by a sequence of matrices of dimension d1 d2, adapted to filtration {Fτ}t τ=1. Let {ti}N i=1 an increasing sequence with elements in [t] for some N t. Consider the sequence {M(n)}N τ=1 of random matrices defined by

i=1 N(ti) E [N(ti)|Fti 1] (41)

Then {M(n)}N n=1 is a martingale adapted to the filtration {Ftn}N n=1.

Moreover,if N(τ) b τ [t] for some b > 0, then we have

P [ M(N) u] (d1 + d2) exp 3u2

Proof. We denote E [ |Fs] as Es [ ] for any s N. Also, let C(s) := Es 1 [N(s)], which is Fs 1-measurable by construction. We have for any n [N],

Etn 1 [C(tn)] = Etn 1 [Etn 1 [N(tn)]] = Etn 1 [N(tn)] (43)

= Etn 1 [N(tn) C(tn)] = 0 (44)

where the first equality is due to the tower rule since Ftn 1 Ftn 1. Also, we have for any τ 1

N(τ) C(τ) 2 = (N(τ) C(τ))2 (45)

Tr (N(τ) C(τ))2 (46)

= Tr (N(τ) C(τ))2 (47)

= N(τ) 2 F 2 Tr(C(τ)N(τ)) + Tr C(τ)2 (48)

N(τ) 2 F + Tr C(τ)2 2b2 (49)

Published in Transactions on Machine Learning Research (12/2025)

Hence N(τ) C(τ) is integrable for any τ 1. This shows that M(n) is a sequence of partial sums of matrix martingale differences, hence it is a matrix martingale.

The second part of the corollary statement is a consequence of Theorem 6. The boundedness of the sequence of martingale differences has already been established above. To verify the second requirement of the theorem, let us compute bounds on the norms of Wcol and Wrow from Theorem 6. Notice that the two matrices are equal since the difference sequence matrices N(ts) are symmetric. Hence, for any n [N], we have

Wcol(N) Wrow(N) Tr(Wcol(N)) Tr(Wrow(N)) (50)

n=1 Etn 1 (N(tn) C(tn))2 !

n=1 Etn 1 h N(tn) 2 F i Etn 1 [2 Tr(C(tn)N(tn))] + Tr C(tn)2 (52)

n=1 Etn 1 h N(tn) 2 F i Tr C(tn)2 (53)

n=1 Etn 1 h N(tn) 2 F i Nb2. (54)

By Theorem 6, we have for any u > 0

P n 1; M(n) u and Wcol(n) Nb2 (55)

P M(N) u and Wcol(N) Nb2 (56)

= P [ M(N) u] (57)

where the last line holds because we showed that the inequality Wcol(N) Nb2 holds almost surely.

Proposition 8 (Concentration of the empirical multi-task Gram matrix around the adapted one). Let t 1, b > 0. Then we have:

op,RE > γ max m V |Tm(t)| bt

d(2|P|e A1t + (|V| + |P|)e A2t + 2|V|e A3t),

A1 := 3γ2 min C P |C|t

A2 := 3γ2 min C P c G(C)t

min C P c G(C)

min C P |C|

A3 := 3γ2 min C P c G(C)2t

2γ min C P c G(C)

Proof. For γ > 0, let us define

t VV and GGram,γ := 1

t V RE,S γ ,

Published in Transactions on Machine Learning Research (12/2025)

where V is block diagonal matrix formed by { m}m V. We also define C and 2C in the same pattern of Definition 4. We can express the complementary of this event as the disjunction of a finite number of events as follows:

Gc Gram,γ (58)

= max C P C r

min C P c G(C) 1 max C P

2C min C P c G(C) 1 max m V m > tγ (59)

2C > t2γ2 min C P c G(C) [

m > tγ min C P c G(C) (60)

The first and third event can be bounded by considering the sequence xx (τ) adapted to the filtration {Fτ}, verifying xx (τ) .

Bounding the probability of the first event Let C P be a cluster. By definition, we have

|C| C(t) = X

τ Tm(t) xx(τ) E [xx(τ)|Fτ 1]

m C Tm(t) xx(τ) E [xx(τ)|Fτ 1]

We will apply Corollary 4 for the sequence of time indices in C, i.e. S

m V Tm(t). Hence |C| C is a martingale sequence, and we have

P C(t) > γt max m V |Tm(t)| bt 2d exp

m C |Tm(t)| + 2

= 2d exp 3γ2|C|t

3γ2 min C P |C|t

Bounding the probability of the third event Let m V be a task index. We apply Corollary 4 for the sequence of time steps in Tm(t). We have

τ Tm(t) xx(τ) E [xx(τ)|Fτ 1]

is a martingale sequence, hence

P m(t) > γ min C P c G(C)t max m V |Tm(t)| bt 2d exp

3γ2 min C P c G(C)2t2

6|Tm(t)| + 2

2γ min C P c G(C)t

3γ2 min C P c G(C)2t2

2γ min C P c G(C)t

3γ2 min C P c G(C)2t

2γ min C P c G(C)

Published in Transactions on Machine Learning Research (12/2025)

Bounding the probability of the second event Let C P be a cluster, and let us denote em the mth

canonical vector of R|C|. We have

τ Tm(t) xx(τ) E [xx(τ)|Fτ 1]

τ Tm(t) xx(τ) E [xx(τ)|Fτ 1]

m C Tm(t) e m(τ) (xx(τ) E [xx(τ)|Fτ 1])

m C Tm(t) e m(τ) xx(τ) E em(τ) xx(τ)|Fτ 1

where the last equality holds since m(τ) is measurable w.r.t. Fτ 1. We will apply the Corollary 4 to the set of time steps S

m C Tm(t) and the adapted sequence e m(τ) xx(τ) of matrices in Rd d|C|. Hence we have

"r 2C(t) > γt min C P

c G(C) max m V |Tm(t)| bt

d(1 + |C|) exp

3γ2|C| min C P c G(C)t2

m C |Tm(t)| + 2

|C| min C P c G(C)t

d(1 + |C|) exp

3γ2|C| min C P c G(C)t

|C| min C P c G(C)

= d(1 + |C|) exp

3γ2 min C P c G(C)t

min C P c G(C)

d(1 + |C|) exp

3γ2 min C P c G(C)t

min C P c G(C)

min C P |C|

Union bound We conclude the result of the statement via a union bound using Equation (60).

Proposition 9 (Concentration of the empirical multi-task Gram matrix around the adapted one, simplified). Let t 1, b > 0. Assume that maxm V |Tm(t)| bt. Then we have:

6d|V| exp 3γ2(min C P( c G(C) c G(C)2)t

where c G(C) := c G(C) |C| C P.

Published in Transactions on Machine Learning Research (12/2025)

Proof. The proof will rely on simple calculus inequalities. Hence, let u = min C P c G(C), v = min C P |C|, f = 3γ2, g = 6b, h = 2

2γ, which are all positive. Then, we have

A1 = fu f + g (u v)f

f + g (u v) (1 u v)f f + g(1 u v)

A2 = fv f + g v

f + g (u v) (1 u v)f f + (1 u v)g

f + gv (v u)2

f + (v u)g (u v) (1 u v)f f + (1 u v)g

where we used the fact that functions of the form x 7 x β1x+β2 for positive β1, β2 are increasing on R+.

As a final step, we use the inequality (1 x)f f + (1 x)g x 1

f + g taken for x = u v, we apply the exp( t)

function and we use the result of Proposition 8, we deduce the result.

B.3.3 From the true to the adapted Gram matrix

For all of the proofs in this subsection, we follow an approach similar to that of Oh et al. (2021). In particular, we use their Lemma 10.

Theorem 7 (Lemma 10 of Oh et al. (2021)). Under Assumption 2 on the context generating distribution, let t 1. We have for any θ Rd:

x arg max x A(t) θ, x

1 2νω Σ (64)

Proposition 10 (RE condition from the true to the adapted Gram matrix). Under Assumption 2, for any

t 1, the adapted Gram matrix VV(t) verifies the compatibility condition with constants κ and ϕ

Proof. For t 1, we have

E x(t)x(t) |Ft 1 = E

x A(t) x(t)x(t) |Ft 1

Let m V. We have

τ Tm(t) E x(τ)x(τ) |Fτ 1

τ Tm(t) E E x(τ)x(τ) |θm(τ 1), Fτ 1 |Fτ 1 (law of total expectation)

τ Tm(t) E x(τ)x(τ) |θm(τ 1) (x(τ) is fully determined by θm(τ 1))

x A(τ) xx 1

x arg max x A(t) θ, x

1 2νω Σ (by Theorem 7). (66)

Published in Transactions on Machine Learning Research (12/2025)

Now, let Z S, where S is defined with constant κ of Assumption 4. Then

m V z Vm(t) 1 2νω

m V zm Σ by Equation (66)

2νω Z 2 RE (by Assumption 4),

which finishes the proof.

Theorem 8 (RE condition holding for the empirical multi-task Gram matrix, generalization of Theorem 2). Under assumptions 2 and 4, let t 1, and let κ, ϕ be the constants from Assumption 4. Assume that

maxm V |Tm(t)| bt. Then, for any γ 0, 1 + a2+(1 κ)+

2 , the empirical multi-task Gram

matrix verifies the RE condition with constants κ and ˆϕ, with

1 γ 1 + a2 + (1 κ)+

with a probability at least equal to 1 6d|V| exp

3γ2 ϕ4(min C P( c G(C) c G(C)2)t

, where ϕ := ϕ

c G(C) := c G(C) |C| C P.

Proof. For the sake of readability, let ϕ = ϕ

2νω the compatibility constant of the adapted Gram matrix, according to Proposition 10. Then:

1 6d|V| exp 3γ2 ϕ4(min C P( c G(C) c G(C)2)t

(by Proposition 9) (69)

t satisfies the RE condition with constant κ and ˆϕ

(by Proposition 7), (70)

where ˆϕ = ϕ

1 γ 1 + a2+(1 κ)+

B.4 Regret bound

Lemma 5 (Concentration of the fraction of observations per task). Assume that |V| 2. Then for δ (0, 1), we have with a probability at least 1 δ:

max m V |Tm(t)|

1 t|V| log |V|

Proof. We have |Tm(t)| := Pt τ=1[m(τ) = m], where t, m V, P [m(t) = m] = 1 |V|, meaning that the binary variable [m(t) = m] follows a Bernoulli distribution B( 1

V ). Then, the random variable Xt := [m(t) = m] 1

|V| has mean 0, variance 1 |V|(1 1 |V|), and verifies |Xt| 1 1 |V| since |V| 2. As a result, via the Bernstein inequality, we have for any m V, and for any w 0,

|V| + w exp

2(1 1 |V|)( 1

Published in Transactions on Machine Learning Research (12/2025)

For the right-hand side to hold with a probability at most δ (0, 1), it is sufficient to have

1 |V| log 1

δ t + 4 log 1

Hence, and via a union bound, we get

1 |V| log 1

max m V |Tm(t)|

1 |V| log 1

δ t + 4 log 1

The result is obtained by adjusting the value of δ.

Theorem 9 (Regret bound, generalization of Theorem 3). Let the mean horizon per node be T = T |V|. Under assumptions1 to 4 and κ > 0, the expected regret of the Network Lasso Bandit algorithm is upper bounded as follows:

α0f(G, Θ, α0)

A log(d|V|) + p

with A = 3γ2 min C P( c G(C) c2 G(C))

Proof. For any time step t, we will define a list of good events under which the Oracle inequality and the RE condition for the empirical multi-task Gram matrix both hold with high probability. Then, we will use those bounds to sum up over time steps until horizon T.

Good events We formalize these requirements as three families of time-depending "good" events.

Gpro(t) is the event that the mean of the empirical process bounded by α(t) up to a constant c, which is equivalent to saying that it converges:

Gpro(t) := 1

Gsel(t) is the event that the number of selections of all tasks is bounded by its expected value up to a small constant ρ(t)

Gsel(t) := max m V |Tm(t)|

GRE(t) is the event that the empirical multi-task Gram matrix 1

t AV(t) satisfies the RE condition.

GRE(t) := 1

t AV(t) verifies the RE condition with constants κ, ˆϕ (74)

Event Gpro(t) is the most straightforward to cover since our bound on the empirical process given in Lemma 3 holds with a probability of at least 1 δ(t), thus:

Published in Transactions on Machine Learning Research (12/2025)

P [Gpro(t)c|Gsel(t)] δ(t), (75)

where we included the time dependency on δ(t) in contrast to the previous section. This way we emphasize to adjust δ(t) after each round, to guarantee a sub linear regret bound. The probability of event Gsel(t) can be determined using Bernstein s inequality:

From Lemma 5 we can select ρ(t) = 2 q

t |V| log |V| δsel(t) + 4

3 log |V| δsel(t) as well as P [Gsel(t)c] δsel(t).

B.4.1 Instantaneous regret decomposition

Now, given the event probabilities, we condition the instantaneous regret r(t) on the good events at a time t > t0. We have for its expectation:

E [r(t)] E [r(t)|Gsel(t)] + 2P [Gsel(t)c]

E [r(t)|Gpro(t) GRE(t) Gsel(t)]

+ 2 (P [Gpro(t)c|Gsel(t)] + P [GRE(t)c|Gsel(t)] + P [Gsel(t)c]) , (76)

where we used the worst case bound r(t) 2 if any one of the good events does not hold.

Bounding the regret Inserting our results of the event probabilities, the oracle inequality and the decomposition of the expected instantaneous regret in Equation (76) and bounding the sum over rounds, yields the final result. Thus, we start by bounding the sum over the first term i.e. the expected regret in case all good events hold:

t=1 E [r(t)|Gpro(t) GRE(t) Gsel(t)]

Taking the result of our oracle inequality in Theorem 5, we point out that only α(t) is time dependent such that the rest of the terms can be pulled outside the sum:

t f(G, Θ, α0)

v u u t1 + 2b

|V| log 1 δ(t) + 2b log 1 δ(t)

ˆϕ2 f(G, Θ, α0)

2|V| log(t) + 4b

ˆϕ2 f(G, Θ, α0) Z T

2|V| log(T) + 2 log(T) dt

ˆϕ2 f(G, Θ, α0) 2

8T |V| + 4 4 s

32 log(|V|T)T

3 log(|V|T) log(T)

2|V| log(T) + p

α0f(G, Θ, α0)

Published in Transactions on Machine Learning Research (12/2025)

f(G, Θ, α0) := a2(G, Θ, α0) +

21 1(κ)w( P)

a2(G, Θ, α0) +

21 1(κ)w( P)

a1(G, Θ, α0) min C P

We upper bounded the sum with an integral i.e. PT t=1 g(t) R T 0 g(t)dt for monotonically decreasing functions g(t) in the last inequality. Also b is the bound on the concentration of the fraction of observation per task provided by Lemma 5. For t0 = p

|V| we find by inserting the result to Lemma 5 for all t > t0:

1 t|V| log |V|

v u u t2 log |V| p

|V||V| + 8 log |V| p

|V| log(|V|) + 2 log(|V|)

Finally we bound the sum over the instantaneous regret term for the bad events:

t=1 2 (P [Gpro(t)c|Gsel(t)] + P [GRE(t)c|Gsel(t)] + P [Gsel(t)c])

By construction, we have max(P [Gpro(t)c|Gsel(t)] , P [Gsel(t)c]) δ(t) = 1

t2 . Hence,

t=1 P [Gpro(t)c|Gsel(t)] + P [Gsel(t)c] 2

As for the RE condition event, letting A := 3γ2 min C P( c G(C) c2 G(C))

2γ , we have for any t0 1

t=t0 P [GRE(t)c|Gsel(t)] 6d|V|

t=t0 exp( At) (by Theorem 8)

6d|V| e At0

1 e A 6d|V|e At0 1 + 1

6d|V|e At0 1 + 1

where in the last line, we used the inequality exp(A) A + 1. Hence, for any u > 0, choosing

A log 6d|V|(1 + 1

implies that PT t=t0 P [GRE(t)c|Gsel(t)] u. Now, we simply have to insert all our results into the sum of instantaneous regrets:

Published in Transactions on Machine Learning Research (12/2025)

R(T) t0 + 2u + 8 + O

α0f(G, Θ, α0)

A log 6d|V|(1 + 1

α0f(G, Θ, α0)

A log(12d|V|(1 + A)) + 1

α0f(G, Θ, α0)

A log(12d|V|(1 + A)) + 1

α0f(G, Θ, α0)

A log(d|V|) + α0f(G, Θ, α0)

A log(d|V|) + α0νωf(G, Θ, α0)

where we set u = 1 2A in the third inequality.

Proof of Corollary 1. Assuming w( P)(ψ+2κ) min C P

c G(C) Ω, with some positive constant Ω< 1 and setting α0 =

1 ψw( P) then the term f(G, Θ, α0) can be bounded as:

f G, Θ, α0 = 1 ψw( P)

a2 a1 min C P

= w( P) ψ +

c G(C) w( P)(ψ + 2κ) + 1

w( P)2 max C P ιG(C)

(1 Ω) min C P

Ωacts as a threshold for the quality of any graph the satisfy this bound. Similarly we can find a bound on the term 1

Published in Transactions on Machine Learning Research (12/2025)

min C P( c G(C) c2 G(C))

1 + w( P) ψ +

1 + w( P)(ψ +

2κ) max C P

min C P( c G(C) c2 G(C))

1 + w( P) ψ +

min C P( c G(C) c2 G(C))

w( P)2 max C P ιG(C)

(1 Ω)2 min C P( c G(C) c2 G(C))

Inserting the terms into the regret bound yields the final result.

Proof of Corollary 2. With α0 = 1 ψw( P) we have for α0f (G, Θ, α0) first:

α0f (G, Θ, α0) = 1 ψw( P)

c G(C) w( P)(ψ + 2κ) + 1

c G(C) w( P)(ψ + 2κ) + 1

Where we used max C P

ιG(C) = 0 in the last equality for the zero boundary case. For 1

Published in Transactions on Machine Learning Research (12/2025)

min C P( c G(C) c2 G(C))

= 8 log(|V|)

min C P( c G(C) c2 G(C)).

Inserting the terms into the regret bound yields the final result.

Proof of Corollary 3. In the case of fully connected clusters, we know from Proposition 4 that the topologival centrality index of any cluster is given by:

c G(C) = |C|2

|C| 1 > |C|

This allows us to lower bound the following term:

min C P( c G(C) c2 G(C)) > |C|

Inserting both lower bounds into our result of Corollary 1 yields the final result.

C Additional experimental details

C.1 About experiments of the main paper

The experiments have been conducted with an intel i7 CPU with 12 2.6 GHz cores and 32 GB of RAM. The two experiments with the highest number of tasks (200) and dimension (80) take about 8 hours, parallelized over the 12 cores.

To generate clusters, we generate |P| variables vii P from the uniform distribution, then we use them to construct a categorical distribution with probabilities proportional to evi. These probabilities defines the cluster proportions.

C.2 Solving the Network Lasso problem

We implement the Primal-Dual algorithm proposed in Jung (2020) to solve the Network Lasso problem but we do not vectorize the matrices (in the sense of stacking their columns into a vector), which speeds up computation.

C.3 Algebraic connectivity vs topological centrality index

Given two fully connected graphs weightless G1 and G2 with size 100 each, we progressively link them by edges and construct the Laplacian L of the resulting graph G. We measure the minimum topological centrality index min1 i 200(L C) 1 ii , and the algebraic connectivity, i.e. the minimum non-null eigenvalue of L.

Published in Transactions on Machine Learning Research (12/2025)

0.0 0.2 0.4 0.6 0.8 1.0 fraction of all edges linking the two complete graphs

algebraic connectivity minimum topological centrality index

Figure 2: Minimum Topological centrality index vs Algebraic Connectivity, for a graph formed by connecting two fully connected initial graphs G1, G2 with size 100 each.

C.4 Fusion rate of clusters

We can plot the fusion rate of clusters, i.e. the fraction of edges linking nodes with identical feature vector estimations and the respective regret as a function of time steps:

0 1000 2000 3000 4000 5000 6000 time steps

fusion rate

Fusion rate True fusion rate

(a) Fusion rate

0 1000 2000 3000 4000 5000 6000 time steps

cumulative regret

Network Lasso

Figure 3: Fusion rate and regret for the case of non-zero noise within clusters. number of users : 100, dimension: 10, number of clusters: 8, imbalance: 1.0, p: 0.5, q: 0.1,

Published in Transactions on Machine Learning Research (12/2025)

0 500 1000 1500 2000 2500 3000 time steps

fusion rate

Fusion rate True fusion rate

(a) Fusion rate

0 500 1000 1500 2000 2500 3000 time steps

cumulative regret

Network Lasso

Figure 4: Fusion rate and regret for the case of non-zero noise within clusters. number of users : 100, dimension: 10, number of clusters: 8, imbalance: 1.0, p: 0.9, q: 0.05,