# transfer_qlearning_with_composite_mdp_structures__ad3d4645.pdf

Transfer Q-Learning with Composite MDP Structures

Jinhang Chai 1 Elynn Chen 2 Lin Yang 3

To bridge the gap between empirical success and theoretical understanding in transfer reinforcement learning (RL), we study a principled approach with provable performance guarantees. We introduce a novel composite MDP framework where high-dimensional transition dynamics are modeled as the sum of a low-rank component representing shared structure and a sparse component capturing task-specific variations. This relaxes the common assumption of purely low-rank transition models, allowing for more realistic scenarios where tasks share core dynamics but maintain individual variations. We introduce UCB-TQL (Upper Confidence Bound Transfer Q-Learning), designed for transfer RL scenarios where multiple tasks share core linear MDP dynamics but diverge along sparse dimensions. When applying UCBTQL to a target task after training on a source task with sufficient trajectories, we achieve a regret bound of e O(

e H5N) that scales independently of the ambient dimension. Here, N represents the number of trajectories in the target task, while e quantifies the sparse differences between tasks. This result demonstrates substantial improvement over single task RL by effectively leveraging their structural similarities. Our theoretical analysis provides rigorous guarantees for how UCB-TQL simultaneously exploits shared dynamics while adapting to task-specific variations.

1. Introduction

Transfer reinforcement learning (RL) has emerged as a promising solution to the fundamental challenge of sample

1Department of Operations Research and Financial Engineering, Princeton University 2Department of Technology, Operations, and Statistics, New York University 3Department of Electrical and Computer Engineering, UCLA. Correspondence to: Elynn Chen <elynn.chen@stern.nyu.edu>, Lin Yang <linyang@ee.ucla.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

inefficiency in RL. By leveraging knowledge from related tasks, transfer learning aims to accelerate policy learning and improve performance in new environments without requiring extensive data collection. This approach has shown empirical success across various domains, from robotics to game playing, yet theoretical understanding of how transfer provably benefits RL remains limited.

Consider autonomous vehicle training as an illustrative example: core driving dynamics including vehicle physics, road rules, and basic navigation remain consistent across different driving scenarios. However, specific environments (urban vs. highway driving, varying weather conditions, different traffic patterns) introduce distinct variations to these core dynamics. This naturally suggests modeling transition dynamics as a combination of shared low-rank structure capturing common elements, plus sparse components representing scenario-specific variations.

We propose a composite MDP framework that formalizes this intuition: transition dynamics are modeled as the sum of a low-rank component representing shared structure and a sparse component capturing task-specific deviations. This structure appears in many real-world applications beyond autonomous driving robotic manipulation with different objects, game playing across varying environments, and resource management under changing constraints all exhibit similar patterns of core shared dynamics with sparse taskspecific variations.

Our approach extends existing work in several important directions. Prior transfer and multi-task RL research has primarily focused on pure low-rank MDPs (Agarwal et al., 2023; Lu et al., 2021; Cheng et al., 2022) or made direct assumptions about value or reward function similarity (Calandriello et al., 2014; Du et al., 2024; Chen et al., 2024; Chai et al., 2025). While sparsity has been studied in the context of value function coefficients, theoretical analysis of sparse transition structures particularly in combination with low-rank components remains unexplored. This gap is significant because transition dynamics often more directly capture task similarity than value functions.

We begin by addressing single-task learning within this composite structure, introducing a variant of UCB-Q-learning tailored specifically for composite MDPs, which may involve a high-dimensional ambient space. In contrast to

Transfer Q-Learning with Composite MDP Structures

previous work, we consider the high-dimensional setting where the feature dimensions p, q number of trajectories N, and the transition core M is no longer a low-rank matrix. This departure from low-rank structures makes existing algorithms designed for linear MDPs inapplicable. Similarly, methods built for low-rank MDPs fail in our context due to the absence of low-rank assumptions in M .

Our work provides the first theoretical guarantees for this setting, demonstrating how the algorithm successfully learns both shared and task-specific components. These results extend and complement the existing body of work on lowrank MDPs by explicitly handling structured deviations from low-rank assumptions (Du et al., 2019b; Lattimore et al., 2020). Unlike the approach in (Foster et al., 2021), which introduced a Decision-Estimation Coefficient (DEC) to characterize the statistical complexity of decision-making across various scenarios, our framework relies on distinct structural assumptions. This necessitates the development of new techniques, as discussed in detail in Section 3.3.

Building on this foundation, we propose UCB-TQL (Upper Confidence Bound Transfer Q-Learning) for transfer learning in composite MDPs. UCB-TQL strategically exploits shared dynamics while efficiently adapting to task-specific variations. Our theoretical analysis demonstrates that UCBTQL achieves dimension-independent regret bounds that explicitly capture dependencies on both rank and sparsity, showing how structural similarities enable efficient knowledge transfer. In particular, we construct a novel confidence region (CR) for the sparse difference, thereby reducing the target sample complexity in the online learning process, as discussed in detail in Section 4.3.

Our primary contributions are as follows.

A novel composite MDP model that combines lowrank shared structure with sparse task-specific components, while allowing high-dimensional feature spaces. This framework better captures real-world task relationships and provides a foundation for future work in multi-task and meta-learning settings.

The first theoretical guarantees for single-task RL under the high-dimensional composite transition structure, demonstrating how algorithms can effectively learn and utilize both shared and task-specific components.

A transfer Q-learning algorithm with provable regret bounds that explicitly characterize how structural similarities enable efficient knowledge transfer across tasks.

This work represents a significant step toward bridging the gap between empirical success of transfer RL and theoretical understanding by providing a rigorous analysis of

how structural similarities in transition dynamics enable efficient knowledge transfer. Our results suggest new directions for developing practical algorithms that can systematically leverage shared structure while accounting for task-specific variations.

1.1. Related Work

Transfer RL. (Agarwal et al., 2023) studied transfer via shared representations between source and target tasks. With generative access to source tasks, they showed that learned representations enable fast convergence to nearoptimal policies in target tasks, matching performance as if ground truth features were known. (Cheng et al., 2022) proposed REFUEL for multitask representation learning in low-rank MDPs. They proved that learning shared representations across multiple tasks is more sample-efficient than individual task learning, provided enough tasks are available. Their analysis covers both online and offline downstream learning with shared representations. (Chen et al., 2022; 2024; Chai et al., 2025) analyzed transfer Qlearning without transition model assumptions, focusing instead on reward function similarity and transition density. These works established convergence guarantees for both backward and iterative Q-learning approaches.

Our work differs by studying transition models with lowrank plus sparse structures. This setting presents unique challenges beyond purely low-rank models, as we must identify and leverage an unknown low-rank space while also accounting for sparse deviations.

Single task RL under structured MDPs. Single-task RL under structured MDPs has evolved through several key advances: Linear MDPs with known representations were initially studied by (Yang & Wang, 2020), leading to provably efficient online algorithms (Sun et al., 2019; Jin et al., 2020; Zanette et al., 2020; Neu & Pike-Burke, 2020; Cai et al., 2020; Wang et al., 2021).

Low-rank MDPs extend this by requiring representation learning. Major developments include FLAMBE (Agarwal et al., 2020) for explore-then-commit transition estimation, and REP-UCB (Uehara et al., 2022) for balancing representation learning with exploration. Recent work has expanded to nonstationary settings (Cheng et al., 2023) and modelfree approaches like MOFFLE (Modi et al., 2024). Related structured models include block MDPs (Du et al., 2019a; Misra et al., 2020; Zhang et al., 2022), low Bellman rank (Jiang et al., 2017), low witness rank (Sun et al., 2019), bilinear classes (Du et al., 2021), and low Bellman eluder dimension (Jin et al., 2021).

Our work introduces the composite MDPs with highdimensional feature space and low-rank plus sparse transition, extending beyond pure low-rank models. We provide

Transfer Q-Learning with Composite MDP Structures

the first theoretical guarantees for UCB Q-learning under this composite structure.

Multitask RL and Meta RL. Research in multitask and meta-RL has evolved through several key theoretical advances. Early work by (Calandriello et al., 2014) examined multitask RL with linear Q-functions sharing sparse support, establishing sample complexity bounds that scale with the sparsity rather than ambient dimension. (Hu et al., 2021) extended this framework by studying weight vectors spanning low-dimensional spaces, showing that sample efficiency improves when the rank is much smaller than both the ambient dimension and number of tasks. (Arora et al., 2020) demonstrated how representation learning reduces sample complexity in imitation learning settings, providing theoretical guarantees for learning shared structure across tasks. (Lu et al., 2022) further developed this direction by analyzing multitask RL with low Bellman error and unknown representations, establishing bounds that improve with task similarity.

Task distribution approaches offered another perspective. (Brunskill & Li, 2013) proved sample complexity benefits when tasks are independently sampled from a finite MDP set, while (Pacchiano et al., 2022) and (M uller & Pacchiano, 2022) extended these results to meta-RL for linear mixture MDPs, showing how learned structure transfers to new tasks. In parallel, research on shared representations by (D Eramo et al., 2020) established faster convergence rates for value iteration under common structure, and (Lu et al., 2021) proved substantial sample efficiency gains in the low-rank MDP setting. (Zhang et al., 2025) propose a cross-market multi-task dynamic pricing framework that achieves minimax-optimal regret bounds for both linear and nonparametric utilities under structured preference shifts.

Our composite MDP structure advances this line of work by explicitly modeling deviations from low-rank similarity through a sparse component. This framework captures more realistic scenarios where tasks share core structure but maintain individual variations, opening new theoretical directions for multitask and meta-learning approaches.

2. Problem Formulation

Episodic MDPs. We consider an episodic Markov decision process (MDP) with finite horizon. It is defined by a tuple M = (S, A, P, r, µ, H), where S denotes the state space, A represents the action space, H is the finite time horizon, r : S A [0, 1] the reward function, P is the state transition probability, and µ is the initial state distribution. Let [H] denote the index set {1, 2, , H}. A policy π : S [H] A maps each state-stage pair to an action that the agent takes in the episode.

For each stage h [H], the value function V π h : S

R evaluates the expected cumulative reward from following policy π starting from state s at time h, defined as V π h (s) = E h PH h =h rh (sh , π(sh )) | sh = s i , and V π H+1(s) = 0, while the action-value function Qπ h : S A R evaluates the value of taking action a in state s at time h, given by Qπ h(s, a) =

rh(s, a) + E h PH h =h+1 rh (sh , π(sh )) | sh = s, ah = a i

and Qπ H(s, a) = rh(s, a). The Bellman equation for V π h and Qπ h can be expressed as V π h (s) = Qπ h(s, πh(s)) and

Qπ h(s, a) = rh(s, a) + [PV π h+1](s, a),

where [PVh+1](s, a) := P

s P(s |s, a)Vh+1(s )1. The Bellman optimally equations for the optimal value function and action-value function are as follows:

V h (s) = max a A rh(s, a) + [PV h+1](s, a) ,

Q h(s, a) = rh(s, a) + [PV h+1](s, a),

with V H+1(s) = 0 and Q H(s, a) = r H(s, a).

The cumulative regret quantifies the performance discrepancy of an agent over episodes. Given an initial state s0 µ, for the nth episode, the regret is the value difference of the optimal policy V (s0) and the agent s chosen policy V πn(s0) which based on its experience up to the beginning of the nth episode and applied throughout the episode. Accumulating over N episodes, it is defined as:

Regret(N) =

n=1 Eµ [V (s0) V πn(s0).]

The agent aims to learn a sequence of policies (π1, . . . , πN) to minimize the cumulative regret. If the reward function has a linear feature representation, any additional regret from an unknown reward becomes a lower-order term and does not affect the regret s overall magnitude. For clarity of presentation, we assume the agent knows the reward function and focus primarily on estimating the transition probability.

Composite MDPs. Let ϕ( ) Rp and ψ( ) Rq be feature functions where p and q can be large. Consider probability transitions P(s |s, a) that can be fully embedded in the feature space via a core matrix M :

P(s |s, a) = ϕ(s, a) M ψ(s ).

Since feature dimensions p and q can be large, we need not know the exact feature functions - we can include many possible features to span the space. What matters is learning the structure of M from data.

1Here P is a function operator mapping from a function S 7 R to a function S A 7 R. One can think of that as a matrix with dimension SA and S in the tabular case.

Transfer Q-Learning with Composite MDP Structures

To capture how transition dynamics combine shared core elements with scenario-specific variations, we impose the following structured assumption on the transition matrix.

Definition 2.1 (Composite MDPs). A probability transition model P : S A (A) can be fully embedded in the feature space characterized by two given feature functions ϕ( ) Rp and ψ( ) Rq where both p and q can be large. The core matrix of the transition model decomposes as:

P(s |s, a) = ϕ(s, a) (L + S ) ψ(s ),

where L is a low-rank incoherent matrix and S is a sparse matrix.

Remark 2.2 (Distinction from linear and low-rank models). The composite MDP model generalizes both linear MDPs and low-rank MDPs. Linear MDPs assume a fixed linear transition structure with known feature maps, while lowrank MDPs drop the need for known features but restrict the transition matrix to be entirely low-rank under some feature representation. In contrast, our composite MDP assumes an unknown feature-based factorization for a low-rank core as in the low-rank MDP, but augments it with an additional sparse component. This hybrid structure captures more complex transition dynamics and task-specific deviations than either linear or low-rank models alone, marking a clear departure from those classical assumptions. Remark 2.3 (Distinction from classical low-rank settings). Our framework explicitly tackles high-dimensional feature spaces, in contrast to the classical low-rank MDP literature that typically assumes moderate dimensionality (Yang & Wang, 2020; Agarwal et al., 2020). We consider regimes where the feature dimensions (p and q) are significantly larger than the number of trajectories (p, q N), and crucially, the shared transition core M is not constrained to be low-rank. This departure renders existing algorithms for linear or low-rank MDPs inapplicable, since those methods rely on a strictly low-rank structure or low feature complexity. The high-dimensional composite setting therefore demands new estimation techniques that can leverage the mixed low-rank and sparse structure to achieve efficient learning. Remark 2.4 (Relation to prior models). The composite MDP shares a structural philosophy with certain prior models notably the linear mixture MDP frameworks studied by (Yang & Wang, 2019) and (Ayoub et al., 2020) in that transitions are factorized via feature maps. Our model can be written in the form of (Yang & Wang, 2019) as P(s | s, a) = ϕ(s, a) α(s ) with α(s ) := (L +S )ψ(s ). This reduces to (Yang & Wang, 2019) when S = 0. Similarly, our model matches the linear factored MDP in (Ayoub et al., 2020) via a Kronecker formulation: P(s | s, a) = (ϕ(s, a) ψ(s )) vec(L + S ). While structurally related, a key difference lies in estimation. Our setting allows p, q N, and our estimator achieves minimax-optimal

error rates that do not scale with d = max(p, q). In contrast, (Yang & Wang, 2019; Ayoub et al., 2020) assume low-dimensional or identifiable parameter spaces and are not suitable for high-dimensional regimes. In addition, our structured assumption is especially crucial in transfer RL, enabling effective knowledge sharing via L while capturing task-specific deviations via S .

3. Single-Task UCB-Q-Learning under High-Dimensional Composite MDPs

This section introduces UCB-Q-Learning for composite MDPs with a high-dimensional feature space. Specifically, we consider the setting where the feature dimensions p, q N, and the transition core M is no longer a lowrank matrix. As a result, existing algorithms designed for linear MDPs are not applicable. Likewise, methods tailored for low-rank MDPs fail in our setting due to the absence of a low-rank structure in M . To address the challenges arising from our relaxed dimensionality constraints and the more complex MDP structure, novel algorithmic approaches are required.

For any tuples (si,h, ai,h) from episode i and stage h: We define ϕi,h = ϕ(si,h, ai,h), ψi,h = ψ(si,h), and Kψ := P

s S ψ(s )ψ(s ) . Our estimator is based on the following population-level equation at each stage h,

E h ψ i,h K 1 ψ | si,h, ai,h i = X

s P(s |si,h, ai,h)ψ(s ) K 1 ψ

s ϕ i,h(L + S )ψ(s )ψ(s ) K 1 ψ

= ϕ i,h(L + S ). (1)

This motivates us to use the sample-level counterpart of (1) to estimate L and S in a composite MDP. However, both L and S are unknown. To recover the low-rank and sparse components, additional assumptions are required to ensure that the low-rank part can be separated from the sparse component. Below, we elaborate on the incoherence assumption and sufficient sparsity conditions.

Assumption 3.1. Let L = U Σ V T be the singular value decomposition (SVD) of L . Let µ be the incoherence parameter (a constant > 1) and r be the rank of L We assume that: (i) (Incoherence.) U 2, , V 2, q

p , where the

2-to-infinity norm 2, denotes the maximum ℓ2-norm of the rows of a matrix. (ii) (Sufficient sparsity.) Matrix S contains at most s nonzero entries, where s s := max{p,q}

4CSµr3 , for some constant CS.

Remark 3.2 (Incoherence condition). The incoherence condition ensures that the singular vectors of a low-rank matrix

Transfer Q-Learning with Composite MDP Structures

are not overly concentrated in any single direction or entry, a property that is crucial for matrix completion (Candes & Recht, 2012). In our setting, it also facilitates the separation of the sparse component from the low-rank matrix. When r and µ are treated as constants, the maximum permissible sparsity level scales linearly with p. Moreover, as shown in (Cand es & Tao, 2010), the incoherence condition holds for a broad class of random matrices.

We consider the online learning setting and propose to estimate L and S in the composite MDP by optimizing the following hard-constrained least-square objective for each episode n [N] with collected tuples (si,h, ai,h) from previous episode i [n] and stage h [H]:

(b Ln, b Sn) arg min L,S Rp q

i [n],h [H] ψ i,h K 1 ψ ϕ i,h(L + S) 2 2

s.t. L = UΣV T , U 2, rµr

(2) Remark 3.3 (Computational Efficiency of Kψ). When |S| is large or infinite, we can compute Kψ using a Monte Carlo approximation: b Kψ = 1 m Pm i=1 ψ(si)ψ(si) , si Unif(S). This is a standard approach in randomized numerical linear algebra (Drineas & Mahoney, 2016) and is computed only once before online learning. Our method is thus comparable in efficiency to (Yang & Wang, 2019), which stores empirical covariances, but we use Kψ to build confidence regions specific to our model.

3.1. UCB-Q Learning for High-Dimensional Composite MDPs

Since the transition dynamics P are typically unknown, we must leverage observed data to approximate the underlying model parameters. To balance the exploration-exploitation trade-off, we adopt the optimism-in-the-face-of-uncertainty principle by employing an Upper Confidence Bound (UCB)- based algorithm. We begin by constructing the confidence region:

Bn = {(L, S) | L b Ln 2

F + S b Sn 2

where b Ln and b Sn are estimated by (2),

βn = CβH log(d NH)

n r(CϕC ψ)2 + s C2 ϕψ , (4)

d = max{p, q} and Cϕ, Cψ, C ψ, Cϕψ are positive parameters defined in the regularity Assumption 3.4, Cβ is a universal constant. The optimistic value functions are given

Algorithm 1 UCB-Q Learning for HD Composite MDPs

Input: Total number of episodes N, feature function ϕ Rp, ψ Rq. for episode n = 1, 2, . . . , N do

Construct confidence region in (3) Calculate Qn,h(s, a) in (5) for stage h = 1, 2, . . . , H do

Take action an,h = arg maxa A Qn,h (sn,h, a) Observe next state sn,h+1 end for Learn transition core estimator b Ln, b Sn using (2). end for

Qn,h(s, a) = r(s, a) + max L,S Bn ϕ(s, a) (L + S)Ψ Vn,h+1,

Qn,H+1(s, a) = 0,

where Vn,h(s) = Π[0,H] [maxa Qn,h(s, a)], with Π[0,H] truncating values to [0, H]. Here, Ψ |S| q is feature matrix, where each row represents the q-dimensional feature vector corresponding to a unique state in the state space S. The algorithm is summarized in Algorithm 2.

3.2. Regret Analysis for UCB-Q-Learning under High-Dimensional Composite MDPs

For the regret analysis, we impose certain regularity conditions on the features as outlined below.

Assumption 3.4. Let Ψ be a matrix with rows as ψ(s) . Let Cϕ, Cψ, C ψ, Cϕψ be positive parameters such that

(i) (s, a), ϕ(s, a) 2 Cϕ, ϕ(s, a) C ϕ;

(ii) Ψ 2, Cψ;

(iii) s , ψ(s ) K 1 ψ 2 C ψ;

(iv) (s, a, s ), ϕ(s, a) ψ(s )K 1 ψ max Cϕψ.

Lemma 3.5 (Transition Estimation Error). For composite MDPs in Definition 2.1, under Assumption 3.1 and 3.4, the estimator obtained by solving program (2) at the end of nth-episode satisfies, with probability at least 1 1/(n2H), that, b LN L 2

F + b SN S 2

where βn is defined in (4).

Remark 3.6. Our method handles the high-dimensional setting by explicitly exploiting the low-rank-plus-sparse structure of the transition core matrix M = L + S , where L

is of low-rank r << d and S is of sparsity s << d. This structure enables consistent estimation in regimes where

Transfer Q-Learning with Composite MDP Structures

p, q N, and is key to our theoretical guarantees. This error bound is minimax optimal with respect to n.

Theorem 3.7 (Single-Task Regret Upper Bound). For composite MDPs in Definition 2.1, under Assumption 3.1 and 3.4, let Regret(NH) be the accumulative regret of a total of N episodes using the UCB-Q-Learning in Algorithm 2. We have that

Regret(NH) CϕCψH2 N X

where d = max{p, q} and

Creg := CϕCψ

Cβ r(CϕC ψ)2 + s C2 ϕψ log(d NH).

Remark 3.8. This regret bound achieves optimal scaling with respect to both the number of trajectories N and ambient dimension d, matching previous results in reinforcement learning (Yang & Wang, 2020; Jin et al., 2020). In Section 4, we demonstrate that transfer learning can substantially reduce both the dependence on ambient dimension d and the scaling with N by effectively utilizing additional trajectories from a source task.

3.3. Challenge and Proof Sketch under the Composite Structure

By optimality condition of (2), it holds that X

i<n,h H ψ i,h K 1 ψ ϕ i,h(b L + b S) 2 2

i<n,h H ψ i,h K 1 ψ ϕ i,h(L + S ) 2 2

Expanding the inequality, we have X

i<n,h H ϕ i,h(b L L ) 2 + ϕ i,h(b S S ) 2 2

i<n,h H ϕ i,h(L + S b L b S), ψ i,h K 1 ψ ϕ i,h M

i<n,h H ϕ i,h(b L L ), ϕ i,h(b S S )

Establishing Theorem 3.7 presents several challenges and requires new techniques. First, deriving a high-probability error bound for b L and b S is nontrivial due to the presence of cross terms at the end of the inequality. To address this, we adapt the separation lemma from (Chai & Fan, 2024), which provides a way to control these cross terms effectively.

Second, ensuring the strong convexity of the linear operator is challenging due to high correlations across stages. To

overcome this, we enforce the strong convexity property by incorporating a restart mechanism for each trajectory.

Thirdly, we must bound the error term Pn 1 i=1 PH h=1 ϕi,h ψ i,h K 1 ψ ϕ i,h M . Since this term forms a martingale difference sequence, we apply matrix concentration techniques to control it effectively.

4. Transition Transfer under Composite MDPs

In this section, we consider transfer learning with target task M (1) and source task M (0). The transition probabilities of the target and source tasks are, respectively,

P(0)(s |s, a) = ϕ(s, a) M (0)ψ(s ), and

P(1)(s |s, a) = ϕ(s, a) M (1)ψ(s ), (6)

where the core transition matrices M (1) and M (0) are different.

We propose modeling task similarity through their transition dynamics: similar tasks share a common low-rank structure capturing core dynamics, while differing only in sparse directions that represent task-specific variations.

Assumption 4.1 (Transition Similarity). Consider the target and source tasks characterized by transition model (6). The target and source tasks are different in that their core transition matrices M (1) = M (0). However, their similarity is defined by:

M (0) = L + S (0), and M (1) = L + S (1), (7)

where both tasks share the same low-rank component L , S (0) 0 = s0, S (1) 0 = s1 are task-specific sparse components, two tasks are similar in the sense that their difference D = S(1) S(0), called the sparsity difference , is very sparse: D 0 = e max{s0, s1}.

We have N0 episodes for the source task and N1 = N episodes for the target task. In practice, N0 N and we would like to use the source task to enhance the performance of the target task. Since our primary focus is on the target data, we don t make specific data generating assumptions on the source data which can be both batch data or generated from certain online process.

For notation brevity, we use i and h to index episodes and time steps of the source task, with i [N0] and h [H]. For the target task, we use j and h to index its episodes and time steps, with j [N] and h [H]. We denote the following state-action-station transition triplet: (si,h, ai,h, s i,h) from the target task and (sj,h, aj,h, s j,h) from the source task. The associated features are

ϕ(0) i,h := ϕ(si,h, ai,h) Rp, ψ(0) i,h := ψ(s i,h) Rq,

ϕ(1) j,h := ϕ(sj,h, aj,h) Rp, ψ(1) j,h := ψ(s j,h) Rq. (8)

Transfer Q-Learning with Composite MDP Structures

Let Kψ := P

s S ψ(s )ψ(s ) . We have, at each step h for the target task,

E h ϕ(1) i,hψ(1) i,h K 1 ψ | si,h, ai,h i = ϕ(1) i,hϕ(1) i,h (L +S (0)).

Similarly, we have for the source task,

E h ϕ(0) j,hψ(0) j,h K 1 ψ | sj,h, aj,h i = ϕ(0) j,hϕ(0) j,h (L +S (1)).

4.1. UCB Transfer Q-Learning for High-Dimensional Composite MDPs

Now we introduce the UCB Transfer Q-Learning (UCBTQL) for HD Composite MDPs. The algorithm is summarized in Algorithm 2. We first introduce the optimizationbased estimator in the following two steps, then proceed to construct the confidence region.

STEP I. Estimate the low-rank and sparse components of the source task by solving2

(b L, b S(0)) arg min L,S Rp q

ψ(0) i,h K 1 ψ ϕ(0) i,h (L + S) 2

s.t. L = UΣV T , U 2, rµr

q , S 0 s0.

STEP II. Use target data to correct the bias of the sparse part in an online fashion.

b Dn arg min D Rp q

ψ(1) j,h K 1 ψ ϕ(1) j,h (b L + b S(0) + D) 2

(10) Then the target estimator for n episode is given by

(b Ln, b S(1) n ) = (b L, b S(0) + b Dn). (11)

To construct the confidence region, suppose at the first stage, we established L b L 2 F + S (0) b S(0) 2 F βN0 with probability at least 1 1/(2N 2H), where we slightly abuse notation by again referring to βN0 as the confidence radius at initial stage of target learning. When the source samples come from the online UCB algorithm as described in Section 3, we have

βN0 = CβH log(d N0H)

r(CϕC ψ)2 + s C2 ϕψ , (12)

d = max{p, q} and Cϕ, Cψ, C ψ, Cϕψ are positive parameters defined in the regularity Assumption 3.4, Cβ is a un iversal constant.

2Note that for simplicity, we assume the sparsity s0 appearing in the constraint is known. It can be replaced by an upper bound on s0.

Algorithm 2 UCB-TQL for Composite MDPs

Input: N0 episodes of source data, feature function ϕ Rp, ψ Rq, number of episodes N of the target task. Calculate pilot transition core estimators b L, b S(0) using (9). for episode n = 1, 2, . . . , N do

Construct confidence region (13). Calculate Qn,h(s, a) in (14). for stage h = 1, 2, . . . , H do

Take action an,h = arg maxa A Qn,h (sn,h, a) Observe sn,h+1 from target domain M(1)

end for Learn transition core estimator using (11). end for

The online confidence region at step n is then constructed as

e Bn = (L, S, D) : L b L(0) 2 F + S D b S(0) 2 F βN0,

D b Dn 2 F β(1) n , D 0 e

(13) where we incorporate D in the decision variables to put direct restriction on the sparsity of sparse difference.

Similarly, the optimistic value functions are calculated as follows.

Qn,h(s, a) = r(s, a) + max L,S e Bn ϕ(s, a) (L + S)Ψ Vn,h+1,

Qn,H+1(s, a) = 0,

Remark 4.2 (Extensions). We focus on sparsity-constrained optimization, which can be extended to a Lasso-type L1 penalty for improved computational efficiency. For brevity, we omit these details here.

4.2. Regret Analysis of UCB-TQL

The following assumption is in parallel to Assumption 3.1. Assumption 4.3. Consider transfer RL setting with transition similarity defined in Assumption 4.1. Recall that L = U Σ V . We assume that U 2, , V 2, q

p and that the sparsity of S (0) and S (1) satisfies

max{s0, s1} s := max{p,q}

4CSµr3 , for some constant CS.

The following theorem demonstrates the provable benefits of UCB-TQL for the target RL task.

Lemma 4.4 (Estimation Error). Let N0 denotes the number of episodes from the source task. Under Assumption 3.4, 4.1, and 4.3, the estimator at the end of nth-episode satisfies with probability at least 1 1/(n2H) that,

F + b Sn S (0) 2

F βN0+ e C2 ϕψH log (dn H)

Transfer Q-Learning with Composite MDP Structures

where βN0 is the initial confidence radius defined in (12).

Remark 4.5. The estimation error bound is minimax optimal with respect to N0, n, and d. We extend these existing results in the contexts of regression and matrix completion (Chai & Fan, 2024) to the settings of reinforcement learning and transfer learning.

Theorem 4.6 (Regret upper bound for UCB-TQL). Let N0 and N denotes the number of episodes from the source and target tasks, respectively. Let Regret(NH) be the accumulative regret of a total of N target episodes using the UCB-TQL in Algorithm 2. Under Assumption 3.4, 4.1, and 4.3, it holds that

Regret(NH) C reg N/ p

e C2 ϕψNH log (d NH)

where s = max{s0, s1} and

C reg := Cϕ + C ϕ e Cψ

CβH5 log(d N0H) r(CϕC ψ)2 + s C2 ϕψ .

Remark 4.7. Note that the first term represents the rate at which the source is learned, while the second term accounts for correcting the bias of the sparse component.

When the source sample size N0 is sufficiently larger than the target sample size, the regret is dominated by the second term. Specifically, when N0 N 2, the regret bound simplifies to e O(

e H5N) , which scales independently of the ambient dimension. Since e d, this represents a significant improvement over the result in (Yang & Wang, 2020).

We also characterize the phase transition. Specifically, when N0 N(r C2 ϕ + s), neglecting the logarithm terms, the regret bound becomes dominated by the second term, corresponding to estimation of the sparse difference.

4.3. Challenges and Proof Sketch of UCB-TQL with High-Dimensional Composite MDPs.

A natural way to construct the confidence region is

Bn = (L, S) | L b Ln 2

F + S b Sn 2

where β(1) n := βN0 + e C2 ϕψH log(d NH)

However, this confidence region is not tight in that we are not fully utilizing the sparse difference D. To be more

specific, plugging the value of β(1) n in (19), we have

Regret(NH) CϕCψH

e C2 ϕψ NH log d NH .

In contrast to (16), we employ a more fine-grained confidence region (13), where we directly restrict the sparsity of the sparse difference D to be bounded, leading to improved rates.

In particular, we have (L , S (0), D ) e Bn, indicating this CR is valid. To bound the one-step error, let (e L, e S, e D) = arg max L,S e Bn ϕ(s, a) (L + S)Ψ Vn,h+1, it holds that

Qn,h(sn,h, an,h) (r(sn,h, an,h) + [Ph Vn,h+1](sn,h, an,h))

ϕ n,h(e L L ) 2

Ψ Vn,h+1 2 + ϕ n,h e S e D S + D 2

Ψ Vn,h+1 2 + ϕ n,h e D D Ψ Vn,h+1 (17) The first two terms can be bounded similar to single-task case. From the constraint in the optimization problem (9) and Assumption 4.3, we have e Dn 0 e, D 0 e, implying e Dn D 0 2e. This observation facilitates a tight bound on the third term. Combining these one-step error bounds then yields the final regret bound in (19).

5. Discussion

When employing low-rank and sparse structures as the core for transition probabilities, several directions for future exploration emerge. Firstly, alternative sparse structures, such as row sparsity, column sparsity, or group sparsity, could be further investigated to understand their impact on learning dynamics and efficiency. These alternative formulations may offer more nuanced or efficient ways to capture the underlying patterns in transition dynamics across different domains.

Secondly, our analysis reveals that the regret bounds of the UBC-TQL algorithm are significantly influenced by the error bounds derived from matrix recovery. Since the Upper Confidence Bound (UCB) is determined by the error bounds of matrix recovery, the regret bound is largely dictated by these errors. An extension goal is to achieve the current levels of regret under more relaxed assumptions. This could involve developing new theoretical frameworks or algorithms that either provide tighter error bounds or leverage additional structure in the transition dynamics that has not been fully exploited.

Transfer Q-Learning with Composite MDP Structures

Acknowledgements

Elynn Chen s research is supported in part by the NSF Award 2412577. Lin Yang s research is supported in part by the NSF Award 2221871 and an Amazon Faculty Award.

Impact Statement

This work addresses a fundamental gap between the empirical success of transfer reinforcement learning and its limited theoretical foundation, while presenting a general approach applicable across robotics, autonomous systems, and other complex decision-making environments. Our core contribution is a novel composite MDP framework in which each task s transition dynamics are modeled as the sum of a low-rank shared structure and a sparse task-specific component. Leveraging this formulation, we develop UCB-TQL (Upper Confidence Bound Transfer Q-Learning), a new algorithm with provable regret bounds and rigorous structural performance guarantees. By uniting theoretical guarantees with broad real-world applicability, this research bridges the gap between practical transfer learning and formal analysis, laying the groundwork for more efficient and principled reinforcement learning across diverse domains.

Agarwal, A., Kakade, S., Krishnamurthy, A., and Sun, W. Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095 20107, 2020.

Agarwal, A., Song, Y., Sun, W., Wang, K., Wang, M., and Zhang, X. Provable benefits of representational transfer in reinforcement learning. In The Thirty Sixth Annual Conference on Learning Theory, pp. 2114 2187. PMLR, 2023.

Arora, S., Du, S., Kakade, S., Luo, Y., and Saunshi, N. Provable representation learning for imitation learning via bi-level optimization. In International Conference on Machine Learning, pp. 367 376. PMLR, 2020.

Ayoub, A., Jia, Z., Szepesvari, C., Wang, M., and Yang, L. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pp. 463 474. PMLR, 2020.

Brunskill, E. and Li, L. Sample complexity of multi-task reinforcement learning. In Proceedings of the Twenty Ninth Conference on Uncertainty in Artificial Intelligence, pp. 122 131, 2013.

Cai, Q., Yang, Z., Jin, C., and Wang, Z. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pp. 1283 1294. PMLR, 2020.

Calandriello, D., Lazaric, A., and Restelli, M. Sparse multitask reinforcement learning. Advances in neural information processing systems, 27, 2014.

Candes, E. and Recht, B. Exact matrix completion via convex optimization. Communications of the ACM, 55 (6):111 119, 2012.

Cand es, E. J. and Tao, T. The power of convex relaxation: Near-optimal matrix completion. IEEE transactions on information theory, 56(5):2053 2080, 2010.

Chai, J. and Fan, J. Structured matrix learning under arbitrary entrywise dependence and estimation of markov transition kernel. ar Xiv preprint ar Xiv:2401.02520, 2024.

Chai, J., Chen, E., and Fan, J. Deep transfer Q-learning for offline non-stationary reinforcement learning. ar Xiv preprint ar Xiv:2501.04870, 2025.

Chen, E., Li, S., and Jordan, M. I. Transfer Q-learning. ar Xiv preprint ar Xiv:2202.04709, 2022.

Chen, E., Chen, X., and Jing, W. Data-driven knowledge transfer in batch Q learning. ar Xiv preprint ar Xiv:2404.15209, 2024.

Transfer Q-Learning with Composite MDP Structures

Cheng, Y., Feng, S., Yang, J., Zhang, H., and Liang, Y. Provable benefit of multitask representation learning in reinforcement learning. Advances in Neural Information Processing Systems, 35:31741 31754, 2022.

Cheng, Y., Yang, J., and Liang, Y. Provably efficient algorithm for nonstationary low-rank mdps. Advances in Neural Information Processing Systems, 36:6330 6372, 2023.

D Eramo, C., Tateo, D., Bonarini, A., Restelli, M., and Peters, J. Sharing knowledge in multi-task deep reinforcement learning. International Conference on Learning Representations, 2020.

Drineas, P. and Mahoney, M. W. Randnla: randomized numerical linear algebra. Communications of the ACM, 59(6):80 90, 2016.

Du, A. Y., Yang, L. F., and Wang, R. Misspecified q-learning with sparse linear function approximation: Tight bounds on approximation error. ar Xiv preprint ar Xiv:2407.13622, 2024.

Du, S., Krishnamurthy, A., Jiang, N., Agarwal, A., Dudik, M., and Langford, J. Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pp. 1665 1674. PMLR, 2019a.

Du, S., Kakade, S., Lee, J., Lovett, S., Mahajan, G., Sun, W., and Wang, R. Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pp. 2826 2836. PMLR, 2021.

Du, S. S., Kakade, S. M., Wang, R., and Yang, L. F. Is a good representation sufficient for sample efficient reinforcement learning? ar Xiv preprint ar Xiv:1910.03016, 2019b.

Foster, D. J., Kakade, S. M., Qian, J., and Rakhlin, A. The statistical complexity of interactive decision making. ar Xiv preprint ar Xiv:2112.13487, 2021.

Hu, J., Chen, X., Jin, C., Li, L., and Wang, L. Near-optimal representation learning for linear bandits and linear rl. In International Conference on Machine Learning, pp. 4349 4358. PMLR, 2021.

Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pp. 1704 1713. PMLR, 2017.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on learning theory, pp. 2137 2143. PMLR, 2020.

Jin, C., Liu, Q., and Miryoosefi, S. Bellman eluder dimension: New rich classes of rl problems, and sampleefficient algorithms. Advances in neural information processing systems, 34:13406 13418, 2021.

Lattimore, T., Szepesvari, C., and Weisz, G. Learning with good feature representations in bandits and in rl with a generative model. In International conference on machine learning, pp. 5662 5670. PMLR, 2020.

Lu, R., Huang, G., and Du, S. S. On the power of multitask representation learning in linear mdp. ar Xiv preprint ar Xiv:2106.08053, 2021.

Lu, R., Zhao, A., Du, S. S., and Huang, G. Provable general function class representation learning in multitask bandits and mdp. Advances in Neural Information Processing Systems, 35:11507 11519, 2022.

Misra, D., Henaff, M., Krishnamurthy, A., and Langford, J. Kinematic state abstraction and provably efficient richobservation reinforcement learning. In International conference on machine learning, pp. 6961 6971. PMLR, 2020.

Modi, A., Chen, J., Krishnamurthy, A., Jiang, N., and Agarwal, A. Model-free representation learning and exploration in low-rank mdps. Journal of Machine Learning Research, 25(6):1 76, 2024.

M uller, R. and Pacchiano, A. Meta learning mdps with linear transition models. In International Conference on Artificial Intelligence and Statistics, pp. 5928 5948. PMLR, 2022.

Neu, G. and Pike-Burke, C. A unifying view of optimism in episodic reinforcement learning. Advances in Neural Information Processing Systems, 33:1392 1403, 2020.

Pacchiano, A., Nachum, O., Tripuraneni, N., and Bartlett, P. Joint representation training in sequential tasks with shared structure. ar Xiv preprint ar Xiv:2206.12441, 2022.

Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pp. 2898 2933. PMLR, 2019.

Tropp, J. Freedman s inequality for matrix martingales. 2011.

Uehara, M., Zhang, X., and Sun, W. Representation learning for online and offline rl in low-rank mdps. International Conference on Learning Representations, 2022.

Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010.

Transfer Q-Learning with Composite MDP Structures

Wang, Y., Wang, R., Du, S. S., and Krishnamurthy, A. Optimism in reinforcement learning with generalized linear function approximation. International Conference on Learning Representations, 2021.

Yang, L. and Wang, M. Sample-optimal parametric qlearning using linearly additive features. In International conference on machine learning, pp. 6995 7004. PMLR, 2019.

Yang, L. and Wang, M. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pp. 10746 10756. PMLR, 2020.

Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pp. 10978 10989. PMLR, 2020.

Zhang, X., Song, Y., Uehara, M., Wang, M., Agarwal, A., and Sun, W. Efficient reinforcement learning in block mdps: A model-free representation learning approach. In International Conference on Machine Learning, pp. 26517 26547. PMLR, 2022.

Zhang, Y., Chen, E., and Yujun, Y. Transfer faster, price smarter: Minimax dynamic pricing under cross-market preference shift. ar Xiv preprint: 2505.17203, 2025.

Transfer Q-Learning with Composite MDP Structures

Supplemental Materials

Notation We use [h] = {1, 2, . . . , h} for integers from 1 to h. In this paper, vectors are assumed to be column vectors. For a vector v Rp, the norms v 1, v 2, and v represent the 1-norm, Euclidean norm(or 2-norm), and infinity norm, respectively. For a matrix M Rp q, we use the following notation for norms: M 0 denotes the number of non-zero elements, M 1 = Pp i=1 Pq j=1 |Mij| is the sum of the absolute values of all elements, and M max = maxi,j |Mij| is the

maximum absolute value among elements. The Frobenius norm is M F = p

Tr(M M) = q Pd1 j=1 Pd2 k=1 M 2 jk, which

is also equivalent to q P j σj(M)2, where σj(M) are the singular values of M. The nuclear norm is M = P j σj(M),

and the operator norm is M op = maxj σj(M). Moreover, the 2 to norm is defined as M 2, = maxd1 j=1 Mj,: 2 For two matrices, L, S represents the Euclidean inner product. We use an = O(bn) or an bn if there exists some C > 0 such that an Cbn. e O( ) is similarly defined, neglecting logarithmic factors. Constants c, C, c0, may vary from line to line.

A. Regret Analysis of the Single-Task UCB-Q-Learning with Composite MDP Structures

We present below the proof of Theorem 3.7. The proof of Lemma 3.5 emerges as an intermediate step along the way.

Proof of Theorem 3.7. We first sketch the proof as follows. First of all, we define the good event that the ground truth transition core matrix before episode n lies in the confidence region as En, i.e., (L , S ) Bn for any n n 1. We assume En holds first and use concentration to prove that En holds with high probability later. We denote En = 1En.

1. Under En, prove Qn,h Q h using induction.

2. Bound Qn,h(sn,h, an,h) [r(sn,h, an,h) + P( |sn,h, an,h)T Vn,h+1].

3. Bound the total regret by one-step errors derived in Step 2.

We elaborate each step in the sequel.

A.1. Upper confidence bound

Lemma A.1. Given any state-action pair (s, a) S A , for each episode n and decision step h, we have:

Qn,h(s, a) Q h(s, a).

Proof. We use induction. At h = H, we have Qn,H = Q H(s, a) = r(s, a). Assuming the argument is true for 1 < h H, it naturally extends to h = h 1 that Vn,h(s) V h (s), and hence

Qn,h(s, a) = r(s, a) + max L,S Bn ϕ(s, a) (L + S)Ψ Vn,h+1

r(s, a) + ϕ(s, a) (L + S )Ψ Vn,h+1 r(s, a) + [PV h+1](s, a) = Q h(s, a).

A.2. One-step bound

Lemma A.2. For any h [H] and n [N], we have

Qn,h(sn,h, an,h) (r(sn,h, an,h) + [Ph Vn,h+1](sn,h, an,h)) CϕCψH p

Proof. Let (e L, e S) = arg max L,S Bn ϕ(s, a) (L + S)Ψ Vn,h+1, it holds that

Qn,h(sn,h, an,h) (r(sn,h, an,h) + [Ph Vn,h+1](sn,h, an,h)) =ϕ n,h h e L + e S M i Ψ Vn,h+1

=ϕ n,h h (e L L ) + e S S i Ψ Vn,h+1.

Transfer Q-Learning with Composite MDP Structures

Applying H older s inequality, the triangle inequality, and Cauchy-Schwarz inequality, we deduce the following results:

Qn,h(sn,h, an,h) (r(sn,h, an,h) + [Ph Vn,h+1](sn,h, an,h))

ϕ n,h(e L L ) 2

Ψ Vn,h+1 2 + ϕ n,h e S S 2

H Cψ ϕ n,h(e L L ) 2 + Cψ ϕ n,h e S S 2

H Cψ ϕn,h 2 e L L F + Cψ ϕn,h 2 e S S F

CϕCψH e L L F + e S S F

A.3. Regret decomposition

We bound the regret by the sum of one-step errors. To set the stage, let Fn,h be defined as the σ-field generated by all the random variables up until episode n, step h, essentially fixing the sequence s1,1, a1,1, s1,2, a1,2, . . . , sn,h, an,h. To proceed, let δn,h := (Vn,h V πn h )(sn,h), γn,h := Qn,h(sn,h, an,h) (r(sn,h, an,h) + [Ph Vn,h+1](sn,h, an,h)) .

Regret(NH) = E

n=1 [V (s0) V πn(s0)]

n=1 (Vn,1 V πn 1 )(s0)

n=1 E(δn,1).

We have E(δn,1) = E(δn,1En + (1 En)δn,1) E(δn,1En) + HP[En = 0]

E (δn,1En | Fn,1) = (Vn,1 V πn 1 ) (sn,1) En Qn,1 (sn,1, an,1) V πn 1 (sn,1)

=γn,1 + (r (sn,1, an,1) + [PVn,2] (sn,1, an,1)) V πn 1 (sn,1)

γn,1 + E ((Vn,2 V πn 2 ) (sn,2) En | Fn,1)

h=1 E(γn,h | Fn,1)

Therefore, we establish the regret bound with high probability that

Regret(NH) CϕCψH2 N X

2βn + NHP[EN = 0], (19)

where we use En En with n > n .

A.4. Confidence region

In this subsection, we validate the confidence region. Recall the CR is defined in (3).

Transfer Q-Learning with Composite MDP Structures

Denote Xn =

ϕ 1,1 ϕ 1,H ϕ n 1,H

ψ 1,1K 1 ψ ψ 1,HK 1 ψ ψ n 1,HK 1 ψ

, we have the observational model as follows:

Yn = Xn(L + S ) + Wn

where Wn = Yn Xn(L + S ).

In view of (Chai & Fan, 2024), we need the observation model to satisfy the restricted strong convexity condition in order for the low-rank part and sparse part to be separated.

We will later give specific cases in which this inequality holds. In the sequel, we bound X n Wn 2 and X n Wn max.

In fact, we can express X n Wn as

h=1 ϕi,h ψ i,h K 1 ψ ϕ i,h M .

Note that E h ψ i,h K 1 ψ ϕ i,h M |Fi,h i = 0, and X n Wn is a sum of martingale differences. Let Zi,h =

ϕi,h ψ i,h K 1 ψ ϕ i,h M , we have that

Zi,h 2 = ϕi,h ψ i,h K 1 ψ ϕ i,h M

= ϕi,h ψ i,h K 1 ψ + ϕ i,h M

where we used ϕ i,h M = E h ψ i,h K 1 ψ |Fi,h i C ψ.

On the other hand,

h=1 E[Z i,h Zi,h|Fi,h] 4n H(CϕC ψ)2

h=1 E[Zi,h Z i,h|Fi,h] 4n H(CϕC ψ)2.

By Matrix Freedman inequality (Corollary 1.3 in (Tropp, 2011)), we have with probability at least 1 δ that

X n Wn 2 CϕC ψ log d

Similarly, each entry of X n Wn is the sum of martingale differences, almost surely bounded by 2Cϕψ, and we have by Azuma-Hoeffding s inequality and a union bound over d2 entries that, with probability at least 1 δ,

X n Wn max = max i,j |[X n Wn]ij| Cϕψ

Confidence Region

Transfer Q-Learning with Composite MDP Structures

By the optimality condition, we have

Yn Xn(b Ln + b Sn) 2 F Yn Xn(L + S ) 2 F .

Expanding on both sides yields

Xn( L + S) 2 F 2 Wn, Xn( L + S)

= 2 X n Wn, L + S

2 X n Wn 2 L + 2 X n Wn max S 1

2 X n Wn, L + S

2r X n Wn 2 L F + 2

2s X n Wn max S F ,

where we denote L := b Ln L and S := b Sn S .

On the other hand, by (20) and separation lemma in (Chai & Fan, 2024), we have that

Xn( L + S) 2 F c1(n 1) L + S 2 F

2 L 2 F + S 2 F .

Putting together, we have

L 2 F + S 2 F 4 c1(n 1)

2r X n Wn 2 L F +

2s X n Wn max S F

2r X n Wn 2 2 + 2s X n Wn 2max q

L 2 F + S 2 F

which implies that

L 2 F + S 2 F 32 c2 1(n 1)2 r X n Wn 2 2 + s X n Wn 2 max .

Plugging in the aforementioned bounds of X n Wn 2 and X n Wn max, we deduce that Bn is a valid δ-confidence region if we take

n2 r(CϕC ψ)2n H log(d/δ) + s C2 ϕψn H log(d/δ) = CβH log(d/δ)

n r(CϕC ψ)2 + s C2 ϕψ (21)

for some large enough Cβ. In particular, we take δ = 1/(N 2H) in the above display, then by the union bound, P(EN = 0) N 1 N2H = 1 NH .

Combining the definition of βn with (19), we obtain that

Regret(NH) CϕCψH2 N X

where Creg := CϕCψ

Cβ r(CϕC ψ)2 + s C2 ϕψ log(d NH).

As a byproduct, we have that by the end of the N episode, with probability at least 1 1/(N 2H),

F + S b SN 2

Transfer Q-Learning with Composite MDP Structures

A.5. Discussion on Condition (20)

Now we provide an example where Condition (20) holds. At a high level, ϕi,h may be highly correlated, across different steps and episodes. Nonetheless, note that each episodes starts at independent initial states, hence providing diversity to the linear operator Xn. In fact, Condition (20) holds when ϕ(s, a) depends mainly on s and a adds a perturbation effect. To be more concrete, consider the following lemma as an example.

Lemma A.3. Suppose there exists some function ϕ such that ϕ(s, a) ϕ(s) 2 η. And

λmin Eµ[ϕ(s)ϕ(s) ] cmin.

If η(2Cϕ + η) cmin

4 and n Ced for some constant Ce, then with probability at least 1 2e cen, Condition (20) holds with c1 = cmin/4.

Proof. Denote Eµ[ϕ(s)ϕ(s) ] by Σµ. As |ϕ(s) v| ϕ(s) Cϕ for any v Sd, we have that ϕ(s) is sub Gaussian with variance proxy (C ϕ/2)2. By (5.25) in (Vershynin, 2010), there exists some constants Ce, ce such that with probability at least 1 2e cen,

ϕ(si,1)ϕ(si,1) 1

as long as n Ced.

Let = 1 n 1 Pn 1 i=1 ϕ(si,1)ϕ(si,1) 1 n 1 Pn 1 i=1 ϕi,1ϕ i,1, we have for any v Sd that

v v = 1 n 1

[ϕ(si,1) v]2 [ϕ i,1v]2

ϕ(si,1) ϕi,1 ϕ(si,1) + ϕi,1

η(2Cϕ + η).

Hence η(2Cϕ + η) cmin

4 . It follows that

i=1 ϕi,1ϕ i,1 cmin

To conclude, note that

h=1 ϕi,hϕ i,h 1 n 1

i=1 ϕi,1ϕ i,1 cmin

Remark A.4. The condition n Ced requires a warm start. For n Ced, one can use a fixed policy to generate samples. This will not affect the total regret as long as d H is neglegible compared to

B. Regret Analysis for UCB-TQL under Composite MDPs

In this section, we provide proof for Theorem 4.6. The pipeline is similar to the single-task setting.

We start by constructing the confidence region. To that end, again denote Denote X(1) n =

ϕ(1) 1,1 ϕ(1) 1,H ϕ(1) n 1,H

and Y (1) n =

Transfer Q-Learning with Composite MDP Structures

ψ(1) 1,1 K 1 ψ ψ(1) 1,H K 1 ψ ψ(1) n 1,HK 1 ψ

, we have the observational model as follows:

Y (1) n = X(1) n (L + S (0) + D ) + Wn

where Wn := Y (1) n X(1) n (L + S (0) + D ).

By the optimality condition, we have

Y (1) n X(1) n (b L + b S(0) b Dn) 2 F Y (1) n X(1) n (L + S (0) + D ) 2 F .

After some calculation, we obtain

X(1) n ( b Dn D ) 2 F 2 D Y (1) n X(1) n (b L + b S(0) D ), X(1) n ( b Dn D ) E

= 2 D X(1) n Wn + X(1) n (L b L + S (0) b S(0)) , b Dn D E

2 X(1) n Wn

b Dn D 1 + 2 X(1) n X(1) n (L b L + S (0) b S(0)) F

2e X(1) n Wn max

b Dn D F + 2 X(1) n X(1) n (L b L + S (0) b S(0)) F

We have, similar as before, with probability at least 1 δ,

X(1) n Wn max Cϕψ

X(1) n X(1) n n 1

X(1) n X(1) n n 1

then it holds that

b Dn D 2 F e C2 ϕψH log d

n + L b L 2 F + S (0) b S(0) 2 F .

Suppose at the first stage, we established L b L 2 F + S (0) b S(0) 2 F βN0 with probability at least 1 1/(2N 2H), where

βN0 = CβH log(d/δ)

r(CϕC ψ)2 + s C2 ϕψ ,

as in (21).

Naive CR Recall that we can construct a naive confidence region as

Bn = (L, S) | L b Ln 2

F + S b Sn 2

where β(1) n := βN0 + e C2 ϕψH log(d NH)

Transfer Q-Learning with Composite MDP Structures

When using this confidence region to construct optimistic value functions, we can plug the value of β(1) n into (19). It follows that

Regret(NH) CϕCψH2 N X

e C2 ϕψNH log (d NH) .

Remark B.1. When N0 N, this bound is dominated by the second term, which depends on Cϕ.

Tight CR As illustrated in the proof sketch, we can achieve better rate by constructing a more fine-grained confidence region as

e Bn = (L, S, D) | L b L(0) 2

F + S D b S(0) 2

F βN0, D b Dn 2

F β(1) n , D 0 e .

It is straightforward to show that (L , S (0), D ) e Bn. We then carry out a more refined one-step analysis, similar in the vein of Section 3.2.2. In particular, let (e L, e S, e D) = arg max L,S e Bn ϕ(s, a) (L + S)Ψ Vn,h+1, it holds that

Qn,h(sn,h, an,h) (r(sn,h, an,h) + [Ph Vn,h+1](sn,h, an,h))

ϕ n,h(e L L ) 2

Ψ Vn,h+1 2 + ϕ n,h e S e D S + D 2

Ψ Vn,h+1 2 + ϕ n,h e D D Ψ Vn,h+1

H Cψ ϕ n,h(e L L ) 2 + Cψ ϕ n,h e S e D S + D 2

2e HC ϕCψ e D D F

H Cψ ϕn,h 2 e L L F + Cψ ϕn,h 2 e S e D S + D F

2e HC ϕCψ e D D F

CϕCψH e L L F + e S e D S + D F

2e HC ϕCψ e D D F

2βN0 + C ϕCψH q

4β(1) n e, (22) where the sparsity constraint on D is used in bound the third term in the second line. Combined with the one-step error in the regret decomposition (19), we obtain that

Regret(NH) CϕCψH2 N X

βN0 + C ϕCψH2 N X

Cϕ + C ϕ e CψH2N p

βN0 + C ϕCψH2q

e C2 ϕψNH log (d NH).

Plugging the definition of βN0 completes the proof. Remark B.2. When N0 N, this bound is dominated by the second term, which does not depend on Cϕ, but rather on C ϕ. It is tighter than the rate of naive CR approach as Cϕ C ϕ.