# transfer_qlearning_with_composite_mdp_structures__ad3d4645.pdf Transfer Q-Learning with Composite MDP Structures Jinhang Chai 1 Elynn Chen 2 Lin Yang 3 To bridge the gap between empirical success and theoretical understanding in transfer reinforcement learning (RL), we study a principled approach with provable performance guarantees. We introduce a novel composite MDP framework where high-dimensional transition dynamics are modeled as the sum of a low-rank component representing shared structure and a sparse component capturing task-specific variations. This relaxes the common assumption of purely low-rank transition models, allowing for more realistic scenarios where tasks share core dynamics but maintain individual variations. We introduce UCB-TQL (Upper Confidence Bound Transfer Q-Learning), designed for transfer RL scenarios where multiple tasks share core linear MDP dynamics but diverge along sparse dimensions. When applying UCBTQL to a target task after training on a source task with sufficient trajectories, we achieve a regret bound of e O( e H5N) that scales independently of the ambient dimension. Here, N represents the number of trajectories in the target task, while e quantifies the sparse differences between tasks. This result demonstrates substantial improvement over single task RL by effectively leveraging their structural similarities. Our theoretical analysis provides rigorous guarantees for how UCB-TQL simultaneously exploits shared dynamics while adapting to task-specific variations. 1. Introduction Transfer reinforcement learning (RL) has emerged as a promising solution to the fundamental challenge of sample 1Department of Operations Research and Financial Engineering, Princeton University 2Department of Technology, Operations, and Statistics, New York University 3Department of Electrical and Computer Engineering, UCLA. Correspondence to: Elynn Chen , Lin Yang . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). inefficiency in RL. By leveraging knowledge from related tasks, transfer learning aims to accelerate policy learning and improve performance in new environments without requiring extensive data collection. This approach has shown empirical success across various domains, from robotics to game playing, yet theoretical understanding of how transfer provably benefits RL remains limited. Consider autonomous vehicle training as an illustrative example: core driving dynamics including vehicle physics, road rules, and basic navigation remain consistent across different driving scenarios. However, specific environments (urban vs. highway driving, varying weather conditions, different traffic patterns) introduce distinct variations to these core dynamics. This naturally suggests modeling transition dynamics as a combination of shared low-rank structure capturing common elements, plus sparse components representing scenario-specific variations. We propose a composite MDP framework that formalizes this intuition: transition dynamics are modeled as the sum of a low-rank component representing shared structure and a sparse component capturing task-specific deviations. This structure appears in many real-world applications beyond autonomous driving robotic manipulation with different objects, game playing across varying environments, and resource management under changing constraints all exhibit similar patterns of core shared dynamics with sparse taskspecific variations. Our approach extends existing work in several important directions. Prior transfer and multi-task RL research has primarily focused on pure low-rank MDPs (Agarwal et al., 2023; Lu et al., 2021; Cheng et al., 2022) or made direct assumptions about value or reward function similarity (Calandriello et al., 2014; Du et al., 2024; Chen et al., 2024; Chai et al., 2025). While sparsity has been studied in the context of value function coefficients, theoretical analysis of sparse transition structures particularly in combination with low-rank components remains unexplored. This gap is significant because transition dynamics often more directly capture task similarity than value functions. We begin by addressing single-task learning within this composite structure, introducing a variant of UCB-Q-learning tailored specifically for composite MDPs, which may involve a high-dimensional ambient space. In contrast to Transfer Q-Learning with Composite MDP Structures previous work, we consider the high-dimensional setting where the feature dimensions p, q number of trajectories N, and the transition core M is no longer a low-rank matrix. This departure from low-rank structures makes existing algorithms designed for linear MDPs inapplicable. Similarly, methods built for low-rank MDPs fail in our context due to the absence of low-rank assumptions in M . Our work provides the first theoretical guarantees for this setting, demonstrating how the algorithm successfully learns both shared and task-specific components. These results extend and complement the existing body of work on lowrank MDPs by explicitly handling structured deviations from low-rank assumptions (Du et al., 2019b; Lattimore et al., 2020). Unlike the approach in (Foster et al., 2021), which introduced a Decision-Estimation Coefficient (DEC) to characterize the statistical complexity of decision-making across various scenarios, our framework relies on distinct structural assumptions. This necessitates the development of new techniques, as discussed in detail in Section 3.3. Building on this foundation, we propose UCB-TQL (Upper Confidence Bound Transfer Q-Learning) for transfer learning in composite MDPs. UCB-TQL strategically exploits shared dynamics while efficiently adapting to task-specific variations. Our theoretical analysis demonstrates that UCBTQL achieves dimension-independent regret bounds that explicitly capture dependencies on both rank and sparsity, showing how structural similarities enable efficient knowledge transfer. In particular, we construct a novel confidence region (CR) for the sparse difference, thereby reducing the target sample complexity in the online learning process, as discussed in detail in Section 4.3. Our primary contributions are as follows. A novel composite MDP model that combines lowrank shared structure with sparse task-specific components, while allowing high-dimensional feature spaces. This framework better captures real-world task relationships and provides a foundation for future work in multi-task and meta-learning settings. The first theoretical guarantees for single-task RL under the high-dimensional composite transition structure, demonstrating how algorithms can effectively learn and utilize both shared and task-specific components. A transfer Q-learning algorithm with provable regret bounds that explicitly characterize how structural similarities enable efficient knowledge transfer across tasks. This work represents a significant step toward bridging the gap between empirical success of transfer RL and theoretical understanding by providing a rigorous analysis of how structural similarities in transition dynamics enable efficient knowledge transfer. Our results suggest new directions for developing practical algorithms that can systematically leverage shared structure while accounting for task-specific variations. 1.1. Related Work Transfer RL. (Agarwal et al., 2023) studied transfer via shared representations between source and target tasks. With generative access to source tasks, they showed that learned representations enable fast convergence to nearoptimal policies in target tasks, matching performance as if ground truth features were known. (Cheng et al., 2022) proposed REFUEL for multitask representation learning in low-rank MDPs. They proved that learning shared representations across multiple tasks is more sample-efficient than individual task learning, provided enough tasks are available. Their analysis covers both online and offline downstream learning with shared representations. (Chen et al., 2022; 2024; Chai et al., 2025) analyzed transfer Qlearning without transition model assumptions, focusing instead on reward function similarity and transition density. These works established convergence guarantees for both backward and iterative Q-learning approaches. Our work differs by studying transition models with lowrank plus sparse structures. This setting presents unique challenges beyond purely low-rank models, as we must identify and leverage an unknown low-rank space while also accounting for sparse deviations. Single task RL under structured MDPs. Single-task RL under structured MDPs has evolved through several key advances: Linear MDPs with known representations were initially studied by (Yang & Wang, 2020), leading to provably efficient online algorithms (Sun et al., 2019; Jin et al., 2020; Zanette et al., 2020; Neu & Pike-Burke, 2020; Cai et al., 2020; Wang et al., 2021). Low-rank MDPs extend this by requiring representation learning. Major developments include FLAMBE (Agarwal et al., 2020) for explore-then-commit transition estimation, and REP-UCB (Uehara et al., 2022) for balancing representation learning with exploration. Recent work has expanded to nonstationary settings (Cheng et al., 2023) and modelfree approaches like MOFFLE (Modi et al., 2024). Related structured models include block MDPs (Du et al., 2019a; Misra et al., 2020; Zhang et al., 2022), low Bellman rank (Jiang et al., 2017), low witness rank (Sun et al., 2019), bilinear classes (Du et al., 2021), and low Bellman eluder dimension (Jin et al., 2021). Our work introduces the composite MDPs with highdimensional feature space and low-rank plus sparse transition, extending beyond pure low-rank models. We provide Transfer Q-Learning with Composite MDP Structures the first theoretical guarantees for UCB Q-learning under this composite structure. Multitask RL and Meta RL. Research in multitask and meta-RL has evolved through several key theoretical advances. Early work by (Calandriello et al., 2014) examined multitask RL with linear Q-functions sharing sparse support, establishing sample complexity bounds that scale with the sparsity rather than ambient dimension. (Hu et al., 2021) extended this framework by studying weight vectors spanning low-dimensional spaces, showing that sample efficiency improves when the rank is much smaller than both the ambient dimension and number of tasks. (Arora et al., 2020) demonstrated how representation learning reduces sample complexity in imitation learning settings, providing theoretical guarantees for learning shared structure across tasks. (Lu et al., 2022) further developed this direction by analyzing multitask RL with low Bellman error and unknown representations, establishing bounds that improve with task similarity. Task distribution approaches offered another perspective. (Brunskill & Li, 2013) proved sample complexity benefits when tasks are independently sampled from a finite MDP set, while (Pacchiano et al., 2022) and (M uller & Pacchiano, 2022) extended these results to meta-RL for linear mixture MDPs, showing how learned structure transfers to new tasks. In parallel, research on shared representations by (D Eramo et al., 2020) established faster convergence rates for value iteration under common structure, and (Lu et al., 2021) proved substantial sample efficiency gains in the low-rank MDP setting. (Zhang et al., 2025) propose a cross-market multi-task dynamic pricing framework that achieves minimax-optimal regret bounds for both linear and nonparametric utilities under structured preference shifts. Our composite MDP structure advances this line of work by explicitly modeling deviations from low-rank similarity through a sparse component. This framework captures more realistic scenarios where tasks share core structure but maintain individual variations, opening new theoretical directions for multitask and meta-learning approaches. 2. Problem Formulation Episodic MDPs. We consider an episodic Markov decision process (MDP) with finite horizon. It is defined by a tuple M = (S, A, P, r, µ, H), where S denotes the state space, A represents the action space, H is the finite time horizon, r : S A [0, 1] the reward function, P is the state transition probability, and µ is the initial state distribution. Let [H] denote the index set {1, 2, , H}. A policy π : S [H] A maps each state-stage pair to an action that the agent takes in the episode. For each stage h [H], the value function V π h : S R evaluates the expected cumulative reward from following policy π starting from state s at time h, defined as V π h (s) = E h PH h =h rh (sh , π(sh )) | sh = s i , and V π H+1(s) = 0, while the action-value function Qπ h : S A R evaluates the value of taking action a in state s at time h, given by Qπ h(s, a) = rh(s, a) + E h PH h =h+1 rh (sh , π(sh )) | sh = s, ah = a i and Qπ H(s, a) = rh(s, a). The Bellman equation for V π h and Qπ h can be expressed as V π h (s) = Qπ h(s, πh(s)) and Qπ h(s, a) = rh(s, a) + [PV π h+1](s, a), where [PVh+1](s, a) := P s P(s |s, a)Vh+1(s )1. The Bellman optimally equations for the optimal value function and action-value function are as follows: V h (s) = max a A rh(s, a) + [PV h+1](s, a) , Q h(s, a) = rh(s, a) + [PV h+1](s, a), with V H+1(s) = 0 and Q H(s, a) = r H(s, a). The cumulative regret quantifies the performance discrepancy of an agent over episodes. Given an initial state s0 µ, for the nth episode, the regret is the value difference of the optimal policy V (s0) and the agent s chosen policy V πn(s0) which based on its experience up to the beginning of the nth episode and applied throughout the episode. Accumulating over N episodes, it is defined as: Regret(N) = n=1 Eµ [V (s0) V πn(s0).] The agent aims to learn a sequence of policies (π1, . . . , πN) to minimize the cumulative regret. If the reward function has a linear feature representation, any additional regret from an unknown reward becomes a lower-order term and does not affect the regret s overall magnitude. For clarity of presentation, we assume the agent knows the reward function and focus primarily on estimating the transition probability. Composite MDPs. Let ϕ( ) Rp and ψ( ) Rq be feature functions where p and q can be large. Consider probability transitions P(s |s, a) that can be fully embedded in the feature space via a core matrix M : P(s |s, a) = ϕ(s, a) M ψ(s ). Since feature dimensions p and q can be large, we need not know the exact feature functions - we can include many possible features to span the space. What matters is learning the structure of M from data. 1Here P is a function operator mapping from a function S 7 R to a function S A 7 R. One can think of that as a matrix with dimension SA and S in the tabular case. Transfer Q-Learning with Composite MDP Structures To capture how transition dynamics combine shared core elements with scenario-specific variations, we impose the following structured assumption on the transition matrix. Definition 2.1 (Composite MDPs). A probability transition model P : S A (A) can be fully embedded in the feature space characterized by two given feature functions ϕ( ) Rp and ψ( ) Rq where both p and q can be large. The core matrix of the transition model decomposes as: P(s |s, a) = ϕ(s, a) (L + S ) ψ(s ), where L is a low-rank incoherent matrix and S is a sparse matrix. Remark 2.2 (Distinction from linear and low-rank models). The composite MDP model generalizes both linear MDPs and low-rank MDPs. Linear MDPs assume a fixed linear transition structure with known feature maps, while lowrank MDPs drop the need for known features but restrict the transition matrix to be entirely low-rank under some feature representation. In contrast, our composite MDP assumes an unknown feature-based factorization for a low-rank core as in the low-rank MDP, but augments it with an additional sparse component. This hybrid structure captures more complex transition dynamics and task-specific deviations than either linear or low-rank models alone, marking a clear departure from those classical assumptions. Remark 2.3 (Distinction from classical low-rank settings). Our framework explicitly tackles high-dimensional feature spaces, in contrast to the classical low-rank MDP literature that typically assumes moderate dimensionality (Yang & Wang, 2020; Agarwal et al., 2020). We consider regimes where the feature dimensions (p and q) are significantly larger than the number of trajectories (p, q N), and crucially, the shared transition core M is not constrained to be low-rank. This departure renders existing algorithms for linear or low-rank MDPs inapplicable, since those methods rely on a strictly low-rank structure or low feature complexity. The high-dimensional composite setting therefore demands new estimation techniques that can leverage the mixed low-rank and sparse structure to achieve efficient learning. Remark 2.4 (Relation to prior models). The composite MDP shares a structural philosophy with certain prior models notably the linear mixture MDP frameworks studied by (Yang & Wang, 2019) and (Ayoub et al., 2020) in that transitions are factorized via feature maps. Our model can be written in the form of (Yang & Wang, 2019) as P(s | s, a) = ϕ(s, a) α(s ) with α(s ) := (L +S )ψ(s ). This reduces to (Yang & Wang, 2019) when S = 0. Similarly, our model matches the linear factored MDP in (Ayoub et al., 2020) via a Kronecker formulation: P(s | s, a) = (ϕ(s, a) ψ(s )) vec(L + S ). While structurally related, a key difference lies in estimation. Our setting allows p, q N, and our estimator achieves minimax-optimal error rates that do not scale with d = max(p, q). In contrast, (Yang & Wang, 2019; Ayoub et al., 2020) assume low-dimensional or identifiable parameter spaces and are not suitable for high-dimensional regimes. In addition, our structured assumption is especially crucial in transfer RL, enabling effective knowledge sharing via L while capturing task-specific deviations via S . 3. Single-Task UCB-Q-Learning under High-Dimensional Composite MDPs This section introduces UCB-Q-Learning for composite MDPs with a high-dimensional feature space. Specifically, we consider the setting where the feature dimensions p, q N, and the transition core M is no longer a lowrank matrix. As a result, existing algorithms designed for linear MDPs are not applicable. Likewise, methods tailored for low-rank MDPs fail in our setting due to the absence of a low-rank structure in M . To address the challenges arising from our relaxed dimensionality constraints and the more complex MDP structure, novel algorithmic approaches are required. For any tuples (si,h, ai,h) from episode i and stage h: We define ϕi,h = ϕ(si,h, ai,h), ψi,h = ψ(si,h), and Kψ := P s S ψ(s )ψ(s ) . Our estimator is based on the following population-level equation at each stage h, E h ψ i,h K 1 ψ | si,h, ai,h i = X s P(s |si,h, ai,h)ψ(s ) K 1 ψ s ϕ i,h(L + S )ψ(s )ψ(s ) K 1 ψ = ϕ i,h(L + S ). (1) This motivates us to use the sample-level counterpart of (1) to estimate L and S in a composite MDP. However, both L and S are unknown. To recover the low-rank and sparse components, additional assumptions are required to ensure that the low-rank part can be separated from the sparse component. Below, we elaborate on the incoherence assumption and sufficient sparsity conditions. Assumption 3.1. Let L = U Σ V T be the singular value decomposition (SVD) of L . Let µ be the incoherence parameter (a constant > 1) and r be the rank of L We assume that: (i) (Incoherence.) U 2, , V 2, q p , where the 2-to-infinity norm 2, denotes the maximum ℓ2-norm of the rows of a matrix. (ii) (Sufficient sparsity.) Matrix S contains at most s nonzero entries, where s s := max{p,q} 4CSµr3 , for some constant CS. Remark 3.2 (Incoherence condition). The incoherence condition ensures that the singular vectors of a low-rank matrix Transfer Q-Learning with Composite MDP Structures are not overly concentrated in any single direction or entry, a property that is crucial for matrix completion (Candes & Recht, 2012). In our setting, it also facilitates the separation of the sparse component from the low-rank matrix. When r and µ are treated as constants, the maximum permissible sparsity level scales linearly with p. Moreover, as shown in (Cand es & Tao, 2010), the incoherence condition holds for a broad class of random matrices. We consider the online learning setting and propose to estimate L and S in the composite MDP by optimizing the following hard-constrained least-square objective for each episode n [N] with collected tuples (si,h, ai,h) from previous episode i [n] and stage h [H]: (b Ln, b Sn) arg min L,S Rp q i [n],h [H] ψ i,h K 1 ψ ϕ i,h(L + S) 2 2 s.t. L = UΣV T , U 2, rµr (2) Remark 3.3 (Computational Efficiency of Kψ). When |S| is large or infinite, we can compute Kψ using a Monte Carlo approximation: b Kψ = 1 m Pm i=1 ψ(si)ψ(si) , si Unif(S). This is a standard approach in randomized numerical linear algebra (Drineas & Mahoney, 2016) and is computed only once before online learning. Our method is thus comparable in efficiency to (Yang & Wang, 2019), which stores empirical covariances, but we use Kψ to build confidence regions specific to our model. 3.1. UCB-Q Learning for High-Dimensional Composite MDPs Since the transition dynamics P are typically unknown, we must leverage observed data to approximate the underlying model parameters. To balance the exploration-exploitation trade-off, we adopt the optimism-in-the-face-of-uncertainty principle by employing an Upper Confidence Bound (UCB)- based algorithm. We begin by constructing the confidence region: Bn = {(L, S) | L b Ln 2 F + S b Sn 2 where b Ln and b Sn are estimated by (2), βn = CβH log(d NH) n r(CϕC ψ)2 + s C2 ϕψ , (4) d = max{p, q} and Cϕ, Cψ, C ψ, Cϕψ are positive parameters defined in the regularity Assumption 3.4, Cβ is a universal constant. The optimistic value functions are given Algorithm 1 UCB-Q Learning for HD Composite MDPs Input: Total number of episodes N, feature function ϕ Rp, ψ Rq. for episode n = 1, 2, . . . , N do Construct confidence region in (3) Calculate Qn,h(s, a) in (5) for stage h = 1, 2, . . . , H do Take action an,h = arg maxa A Qn,h (sn,h, a) Observe next state sn,h+1 end for Learn transition core estimator b Ln, b Sn using (2). end for Qn,h(s, a) = r(s, a) + max L,S Bn ϕ(s, a) (L + S)Ψ Vn,h+1, Qn,H+1(s, a) = 0, where Vn,h(s) = Π[0,H] [maxa Qn,h(s, a)], with Π[0,H] truncating values to [0, H]. Here, Ψ |S| q is feature matrix, where each row represents the q-dimensional feature vector corresponding to a unique state in the state space S. The algorithm is summarized in Algorithm 2. 3.2. Regret Analysis for UCB-Q-Learning under High-Dimensional Composite MDPs For the regret analysis, we impose certain regularity conditions on the features as outlined below. Assumption 3.4. Let Ψ be a matrix with rows as ψ(s) . Let Cϕ, Cψ, C ψ, Cϕψ be positive parameters such that (i) (s, a), ϕ(s, a) 2 Cϕ, ϕ(s, a) C ϕ; (ii) Ψ 2, Cψ; (iii) s , ψ(s ) K 1 ψ 2 C ψ; (iv) (s, a, s ), ϕ(s, a) ψ(s )K 1 ψ max Cϕψ. Lemma 3.5 (Transition Estimation Error). For composite MDPs in Definition 2.1, under Assumption 3.1 and 3.4, the estimator obtained by solving program (2) at the end of nth-episode satisfies, with probability at least 1 1/(n2H), that, b LN L 2 F + b SN S 2 where βn is defined in (4). Remark 3.6. Our method handles the high-dimensional setting by explicitly exploiting the low-rank-plus-sparse structure of the transition core matrix M = L + S , where L is of low-rank r << d and S is of sparsity s << d. This structure enables consistent estimation in regimes where Transfer Q-Learning with Composite MDP Structures p, q N, and is key to our theoretical guarantees. This error bound is minimax optimal with respect to n. Theorem 3.7 (Single-Task Regret Upper Bound). For composite MDPs in Definition 2.1, under Assumption 3.1 and 3.4, let Regret(NH) be the accumulative regret of a total of N episodes using the UCB-Q-Learning in Algorithm 2. We have that Regret(NH) CϕCψH2 N X where d = max{p, q} and Creg := CϕCψ Cβ r(CϕC ψ)2 + s C2 ϕψ log(d NH). Remark 3.8. This regret bound achieves optimal scaling with respect to both the number of trajectories N and ambient dimension d, matching previous results in reinforcement learning (Yang & Wang, 2020; Jin et al., 2020). In Section 4, we demonstrate that transfer learning can substantially reduce both the dependence on ambient dimension d and the scaling with N by effectively utilizing additional trajectories from a source task. 3.3. Challenge and Proof Sketch under the Composite Structure By optimality condition of (2), it holds that X i 0 such that an Cbn. e O( ) is similarly defined, neglecting logarithmic factors. Constants c, C, c0, may vary from line to line. A. Regret Analysis of the Single-Task UCB-Q-Learning with Composite MDP Structures We present below the proof of Theorem 3.7. The proof of Lemma 3.5 emerges as an intermediate step along the way. Proof of Theorem 3.7. We first sketch the proof as follows. First of all, we define the good event that the ground truth transition core matrix before episode n lies in the confidence region as En, i.e., (L , S ) Bn for any n n 1. We assume En holds first and use concentration to prove that En holds with high probability later. We denote En = 1En. 1. Under En, prove Qn,h Q h using induction. 2. Bound Qn,h(sn,h, an,h) [r(sn,h, an,h) + P( |sn,h, an,h)T Vn,h+1]. 3. Bound the total regret by one-step errors derived in Step 2. We elaborate each step in the sequel. A.1. Upper confidence bound Lemma A.1. Given any state-action pair (s, a) S A , for each episode n and decision step h, we have: Qn,h(s, a) Q h(s, a). Proof. We use induction. At h = H, we have Qn,H = Q H(s, a) = r(s, a). Assuming the argument is true for 1 < h H, it naturally extends to h = h 1 that Vn,h(s) V h (s), and hence Qn,h(s, a) = r(s, a) + max L,S Bn ϕ(s, a) (L + S)Ψ Vn,h+1 r(s, a) + ϕ(s, a) (L + S )Ψ Vn,h+1 r(s, a) + [PV h+1](s, a) = Q h(s, a). A.2. One-step bound Lemma A.2. For any h [H] and n [N], we have Qn,h(sn,h, an,h) (r(sn,h, an,h) + [Ph Vn,h+1](sn,h, an,h)) CϕCψH p Proof. Let (e L, e S) = arg max L,S Bn ϕ(s, a) (L + S)Ψ Vn,h+1, it holds that Qn,h(sn,h, an,h) (r(sn,h, an,h) + [Ph Vn,h+1](sn,h, an,h)) =ϕ n,h h e L + e S M i Ψ Vn,h+1 =ϕ n,h h (e L L ) + e S S i Ψ Vn,h+1. Transfer Q-Learning with Composite MDP Structures Applying H older s inequality, the triangle inequality, and Cauchy-Schwarz inequality, we deduce the following results: Qn,h(sn,h, an,h) (r(sn,h, an,h) + [Ph Vn,h+1](sn,h, an,h)) ϕ n,h(e L L ) 2 Ψ Vn,h+1 2 + ϕ n,h e S S 2 H Cψ ϕ n,h(e L L ) 2 + Cψ ϕ n,h e S S 2 H Cψ ϕn,h 2 e L L F + Cψ ϕn,h 2 e S S F CϕCψH e L L F + e S S F A.3. Regret decomposition We bound the regret by the sum of one-step errors. To set the stage, let Fn,h be defined as the σ-field generated by all the random variables up until episode n, step h, essentially fixing the sequence s1,1, a1,1, s1,2, a1,2, . . . , sn,h, an,h. To proceed, let δn,h := (Vn,h V πn h )(sn,h), γn,h := Qn,h(sn,h, an,h) (r(sn,h, an,h) + [Ph Vn,h+1](sn,h, an,h)) . Regret(NH) = E n=1 [V (s0) V πn(s0)] n=1 (Vn,1 V πn 1 )(s0) n=1 E(δn,1). We have E(δn,1) = E(δn,1En + (1 En)δn,1) E(δn,1En) + HP[En = 0] E (δn,1En | Fn,1) = (Vn,1 V πn 1 ) (sn,1) En Qn,1 (sn,1, an,1) V πn 1 (sn,1) =γn,1 + (r (sn,1, an,1) + [PVn,2] (sn,1, an,1)) V πn 1 (sn,1) γn,1 + E ((Vn,2 V πn 2 ) (sn,2) En | Fn,1) h=1 E(γn,h | Fn,1) Therefore, we establish the regret bound with high probability that Regret(NH) CϕCψH2 N X 2βn + NHP[EN = 0], (19) where we use En En with n > n . A.4. Confidence region In this subsection, we validate the confidence region. Recall the CR is defined in (3). Transfer Q-Learning with Composite MDP Structures Denote Xn = ϕ 1,1 ϕ 1,H ϕ n 1,H ψ 1,1K 1 ψ ψ 1,HK 1 ψ ψ n 1,HK 1 ψ , we have the observational model as follows: Yn = Xn(L + S ) + Wn where Wn = Yn Xn(L + S ). In view of (Chai & Fan, 2024), we need the observation model to satisfy the restricted strong convexity condition in order for the low-rank part and sparse part to be separated. We will later give specific cases in which this inequality holds. In the sequel, we bound X n Wn 2 and X n Wn max. In fact, we can express X n Wn as h=1 ϕi,h ψ i,h K 1 ψ ϕ i,h M . Note that E h ψ i,h K 1 ψ ϕ i,h M |Fi,h i = 0, and X n Wn is a sum of martingale differences. Let Zi,h = ϕi,h ψ i,h K 1 ψ ϕ i,h M , we have that Zi,h 2 = ϕi,h ψ i,h K 1 ψ ϕ i,h M = ϕi,h ψ i,h K 1 ψ + ϕ i,h M where we used ϕ i,h M = E h ψ i,h K 1 ψ |Fi,h i C ψ. On the other hand, h=1 E[Z i,h Zi,h|Fi,h] 4n H(CϕC ψ)2 h=1 E[Zi,h Z i,h|Fi,h] 4n H(CϕC ψ)2. By Matrix Freedman inequality (Corollary 1.3 in (Tropp, 2011)), we have with probability at least 1 δ that X n Wn 2 CϕC ψ log d Similarly, each entry of X n Wn is the sum of martingale differences, almost surely bounded by 2Cϕψ, and we have by Azuma-Hoeffding s inequality and a union bound over d2 entries that, with probability at least 1 δ, X n Wn max = max i,j |[X n Wn]ij| Cϕψ Confidence Region Transfer Q-Learning with Composite MDP Structures By the optimality condition, we have Yn Xn(b Ln + b Sn) 2 F Yn Xn(L + S ) 2 F . Expanding on both sides yields Xn( L + S) 2 F 2 Wn, Xn( L + S) = 2 X n Wn, L + S 2 X n Wn 2 L + 2 X n Wn max S 1 2 X n Wn, L + S 2r X n Wn 2 L F + 2 2s X n Wn max S F , where we denote L := b Ln L and S := b Sn S . On the other hand, by (20) and separation lemma in (Chai & Fan, 2024), we have that Xn( L + S) 2 F c1(n 1) L + S 2 F 2 L 2 F + S 2 F . Putting together, we have L 2 F + S 2 F 4 c1(n 1) 2r X n Wn 2 L F + 2s X n Wn max S F 2r X n Wn 2 2 + 2s X n Wn 2max q L 2 F + S 2 F which implies that L 2 F + S 2 F 32 c2 1(n 1)2 r X n Wn 2 2 + s X n Wn 2 max . Plugging in the aforementioned bounds of X n Wn 2 and X n Wn max, we deduce that Bn is a valid δ-confidence region if we take n2 r(CϕC ψ)2n H log(d/δ) + s C2 ϕψn H log(d/δ) = CβH log(d/δ) n r(CϕC ψ)2 + s C2 ϕψ (21) for some large enough Cβ. In particular, we take δ = 1/(N 2H) in the above display, then by the union bound, P(EN = 0) N 1 N2H = 1 NH . Combining the definition of βn with (19), we obtain that Regret(NH) CϕCψH2 N X where Creg := CϕCψ Cβ r(CϕC ψ)2 + s C2 ϕψ log(d NH). As a byproduct, we have that by the end of the N episode, with probability at least 1 1/(N 2H), F + S b SN 2 Transfer Q-Learning with Composite MDP Structures A.5. Discussion on Condition (20) Now we provide an example where Condition (20) holds. At a high level, ϕi,h may be highly correlated, across different steps and episodes. Nonetheless, note that each episodes starts at independent initial states, hence providing diversity to the linear operator Xn. In fact, Condition (20) holds when ϕ(s, a) depends mainly on s and a adds a perturbation effect. To be more concrete, consider the following lemma as an example. Lemma A.3. Suppose there exists some function ϕ such that ϕ(s, a) ϕ(s) 2 η. And λmin Eµ[ϕ(s)ϕ(s) ] cmin. If η(2Cϕ + η) cmin 4 and n Ced for some constant Ce, then with probability at least 1 2e cen, Condition (20) holds with c1 = cmin/4. Proof. Denote Eµ[ϕ(s)ϕ(s) ] by Σµ. As |ϕ(s) v| ϕ(s) Cϕ for any v Sd, we have that ϕ(s) is sub Gaussian with variance proxy (C ϕ/2)2. By (5.25) in (Vershynin, 2010), there exists some constants Ce, ce such that with probability at least 1 2e cen, ϕ(si,1)ϕ(si,1) 1 as long as n Ced. Let = 1 n 1 Pn 1 i=1 ϕ(si,1)ϕ(si,1) 1 n 1 Pn 1 i=1 ϕi,1ϕ i,1, we have for any v Sd that v v = 1 n 1 [ϕ(si,1) v]2 [ϕ i,1v]2 ϕ(si,1) ϕi,1 ϕ(si,1) + ϕi,1 η(2Cϕ + η). Hence η(2Cϕ + η) cmin 4 . It follows that i=1 ϕi,1ϕ i,1 cmin To conclude, note that h=1 ϕi,hϕ i,h 1 n 1 i=1 ϕi,1ϕ i,1 cmin Remark A.4. The condition n Ced requires a warm start. For n Ced, one can use a fixed policy to generate samples. This will not affect the total regret as long as d H is neglegible compared to B. Regret Analysis for UCB-TQL under Composite MDPs In this section, we provide proof for Theorem 4.6. The pipeline is similar to the single-task setting. We start by constructing the confidence region. To that end, again denote Denote X(1) n = ϕ(1) 1,1 ϕ(1) 1,H ϕ(1) n 1,H and Y (1) n = Transfer Q-Learning with Composite MDP Structures ψ(1) 1,1 K 1 ψ ψ(1) 1,H K 1 ψ ψ(1) n 1,HK 1 ψ , we have the observational model as follows: Y (1) n = X(1) n (L + S (0) + D ) + Wn where Wn := Y (1) n X(1) n (L + S (0) + D ). By the optimality condition, we have Y (1) n X(1) n (b L + b S(0) b Dn) 2 F Y (1) n X(1) n (L + S (0) + D ) 2 F . After some calculation, we obtain X(1) n ( b Dn D ) 2 F 2 D Y (1) n X(1) n (b L + b S(0) D ), X(1) n ( b Dn D ) E = 2 D X(1) n Wn + X(1) n (L b L + S (0) b S(0)) , b Dn D E 2 X(1) n Wn b Dn D 1 + 2 X(1) n X(1) n (L b L + S (0) b S(0)) F 2e X(1) n Wn max b Dn D F + 2 X(1) n X(1) n (L b L + S (0) b S(0)) F We have, similar as before, with probability at least 1 δ, X(1) n Wn max Cϕψ X(1) n X(1) n n 1 X(1) n X(1) n n 1 then it holds that b Dn D 2 F e C2 ϕψH log d n + L b L 2 F + S (0) b S(0) 2 F . Suppose at the first stage, we established L b L 2 F + S (0) b S(0) 2 F βN0 with probability at least 1 1/(2N 2H), where βN0 = CβH log(d/δ) r(CϕC ψ)2 + s C2 ϕψ , as in (21). Naive CR Recall that we can construct a naive confidence region as Bn = (L, S) | L b Ln 2 F + S b Sn 2 where β(1) n := βN0 + e C2 ϕψH log(d NH) Transfer Q-Learning with Composite MDP Structures When using this confidence region to construct optimistic value functions, we can plug the value of β(1) n into (19). It follows that Regret(NH) CϕCψH2 N X e C2 ϕψNH log (d NH) . Remark B.1. When N0 N, this bound is dominated by the second term, which depends on Cϕ. Tight CR As illustrated in the proof sketch, we can achieve better rate by constructing a more fine-grained confidence region as e Bn = (L, S, D) | L b L(0) 2 F + S D b S(0) 2 F βN0, D b Dn 2 F β(1) n , D 0 e . It is straightforward to show that (L , S (0), D ) e Bn. We then carry out a more refined one-step analysis, similar in the vein of Section 3.2.2. In particular, let (e L, e S, e D) = arg max L,S e Bn ϕ(s, a) (L + S)Ψ Vn,h+1, it holds that Qn,h(sn,h, an,h) (r(sn,h, an,h) + [Ph Vn,h+1](sn,h, an,h)) ϕ n,h(e L L ) 2 Ψ Vn,h+1 2 + ϕ n,h e S e D S + D 2 Ψ Vn,h+1 2 + ϕ n,h e D D Ψ Vn,h+1 H Cψ ϕ n,h(e L L ) 2 + Cψ ϕ n,h e S e D S + D 2 2e HC ϕCψ e D D F H Cψ ϕn,h 2 e L L F + Cψ ϕn,h 2 e S e D S + D F 2e HC ϕCψ e D D F CϕCψH e L L F + e S e D S + D F 2e HC ϕCψ e D D F 2βN0 + C ϕCψH q 4β(1) n e, (22) where the sparsity constraint on D is used in bound the third term in the second line. Combined with the one-step error in the regret decomposition (19), we obtain that Regret(NH) CϕCψH2 N X βN0 + C ϕCψH2 N X Cϕ + C ϕ e CψH2N p βN0 + C ϕCψH2q e C2 ϕψNH log (d NH). Plugging the definition of βN0 completes the proof. Remark B.2. When N0 N, this bound is dominated by the second term, which does not depend on Cϕ, but rather on C ϕ. It is tighter than the rate of naive CR approach as Cϕ C ϕ.