# weakly_coupled_deep_qnetworks__92707c10.pdf Weakly Coupled Deep Q-Networks Ibrahim El Shar Hitachi America Ltd., University of Pittsburgh Sunnyvale, CA ibrahim.elshar@hal.hitachi.com Daniel Jiang Meta, University of Pittsburgh New York, NY drjiang@meta.com We propose weakly coupled deep Q-networks (WCDQN), a novel deep reinforcement learning algorithm that enhances performance in a class of structured problems called weakly coupled Markov decision processes (WCMDP). WCMDPs consist of multiple independent subproblems connected by an action space constraint, which is a structural property that frequently emerges in practice. Despite this appealing structure, WCMDPs quickly become intractable as the number of subproblems grows. WCDQN employs a single network to train multiple DQN subagents, one for each subproblem, and then combine their solutions to establish an upper bound on the optimal action value. This guides the main DQN agent towards optimality. We show that the tabular version, weakly coupled Q-learning (WCQL), converges almost surely to the optimal action value. Numerical experiments show faster convergence compared to DQN and related techniques in settings with as many as 10 subproblems, 310 total actions, and a continuous state space. 1 Introduction Despite achieving many noteworthy and highly visible successes, it remains widely acknowledged that practical implementation of reinforcement learning (RL) is, in general, challenging [15]. This is particularly true in real-world settings where, unlike in simulated settings, interactions with the environment are costly to obtain. One promising path toward more sample-efficient learning in real-world situations is to incorporate known structural properties of the underlying Markov decision process (MDP) into the learning algorithm. As elegantly articulated by [44], structural properties can be considered a type of side information that can be exploited by the RL agent for its benefit. Instantiations of this concept are plentiful and diverse: examples include factored decompositions [33, 10, 47], latent or contextual MDPs [21, 39, 52], block MDPs [14], linear MDPs [32], shape-constrained value and/or policy functions [49, 37, 31], MDPs adhering to closure under policy improvement [8], and multi-timescale or hierarchical MDPs [23, 13], to name just a few. In this paper, we focus on a class of problems called weakly coupled MDPs (WCMDPs) and show how one can leverage their inherent structure through a tailored RL approach. WCMDPs, often studied in the field of operations research, consist of multiple subproblems that are independent from each other except for a coupling constraint on the action space [24]. This type of weakly coupled structure frequently emerges in practice, spanning domains like supply chain management [24], recommender systems [65], online advertising [9], revenue management [53], and stochastic job scheduling [63]. Such MDPs can quickly become intractable when RL algorithms are applied naively, given that their state and action spaces grow exponentially with the number of subproblems [45]. One can compute an upper bound on the optimal value of a WCMDP by performing a Lagrangian relaxation on the action space coupling constraints. Importantly, the weakly coupled structure allows the relaxed problem to be completely decomposed across the subproblems, which are significantly easier to solve than the full MDP [24, 1]. Our goal in this paper is to devise a method that can integrate the Lagrangian relaxation upper bounds into the widely adopted value-based RL approaches of Q-learning 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Figure 1: An illustration of our RL approach for WCMDPs. Our approach takes a single full transition τ (as collected by a standard RL agent) and decomposes it into subproblem transitions τi that are passed to subagents, which are powered by a single network and aim to solve the easier subproblems. The results of these subagents are then used collectively to guide the main agent toward the optimal policy, whose actions need to satisfy linking constraints. Here, we illustrate the case of a single linking constraint that requires the sum of the actions to be bounded by a right-hand-side quantity b(w), where w is an exogenous state from the environment. [59] and deep Q-networks (DQN) [43]. Our proposed method is, to our knowledge, the first to explore the use of Lagrangian relaxations to tackle general WCMDPs in a fully model-free, deep RL setting. Main contributions. We make the following methodological and empirical contributions. 1. First, we propose a novel deep RL algorithm, called weakly coupled deep Q-networks (WCDQN), that exploits weakly coupled structure by using a set of subagents, each attached to one of the subproblems, whose solutions are combined to help improve the performance of main DQN agent; see Figure 1 for a high-level overview. 2. Second, we also propose and analyze a tabular version of our algorithm called weakly coupled Q-learning (WCQL), which serves to conceptually motivate WCDQN. We show that WCQL converges almost surely to the optimal action-value. 3. Finally, we conduct numerical experiments on a suite of realistic problems, including electric vehicle charging station optimization, multi-product inventory control, and online stochastic ad matching. The results show that our proposed algorithm outperform baselines by a relatively large margin in settings with as many as 10 subproblems, 310 total actions, and a continuous state space. 2 Related Literature Weakly Coupled MDPs. This line of work began with [61] under the name of restless multi-armed bandits (RMAB), where there are two actions ( active or passive ) for each subproblem (also known as project or arm ), under a budget constraint on the number of active arms at any given time.1 As we will see soon, this is a special case of a WCMDP with two actions per subproblem and a single budget constraint. A popular solution approach to RMABs is the Whittle index policy, which was first proposed by [61] and uses the idea of ranking arms by their marginal productivity. The policy has been extensively studied in the literature from both applied and theoretical perspectives [20, 41, 29, 28, 42, 64]. Whittle conjectured in [61] that the Whittle index policy is asymptotically optimal under a condition called indexability; later, [60] established that asymptotic optimality requires indexability, but also another technical condition, both of which are difficult to verify. 1We note that WCMDPs should not be confused with constrained MDPs, where a budget constraint is imposed on the overall cost of the policy in all periods [2]. As discussed in detail by [64], relying on the Whittle index policy in real-world problems can be problematic due to hard-to-verify technical conditions (and if not met, computational robustness and the heuristic s original intuitive motivation may be lost). A number of recent papers have considered using RL in the setting of RMABs, but nearly all of them are based on Whittle indices [19, 46, 34, 35, 3, 50, 62], and are thus most useful when the indexability condition can be verified. Exceptions are [34] and [35], which propose to run RL directly on the Lagrangian relaxation of the true problem to obtain a Lagrange policy. Our paper is therefore closest in spirit to these two works, but our methods target the optimal value and policy (with the help of Lagrangian relaxation) rather than pursuing the Lagrange policy as the end goal (which does not have optimality guarantees in general). Moreover, compared to the other RL approaches mentioned above, we do not require the indexability condition and our method works for any WCMDP. Relaxations of WCMDPs can be performed in several different ways, including approximate linear programming (ALP) [1, 11], network relaxation [45], and Lagrangian relaxation [61, 53, 24, 7, 1, 54, 11]. Notably, [1] provided key results for the ALP and Lagrangian relaxation approaches, and [11] gave theoretical justification for the closeness of the bounds obtained by the approximate linear programming and Lagrangian relaxation approaches, an empirical observation made in [1]. Our work focuses specifically on the Lagrangian relaxation approach, which relaxes the linking constraints on the action space by introducing a penalty in the objective. DQN and Q-learning. The Q-learning algorithm [59] is perhaps the most popular value-based tabular RL algorithm [30, 55, 6], and the DQN approach of [43] extends the fundamental ideas behind Q-learning to the case where Q-functions are approximated using deep neural networks, famously demonstrated on a set of Atari games. Unfortunately, practical implementation of Q-learning, DQN, and their extensions on real-world problems can be difficult due to the large number of samples required for learning [44]. Various papers have attempted to extend and enhance the DQN algorithm. For example, to overcome the over-estimation problem and improve stability, [57] proposes double DQN, which adapts the tabular approach of double Q-learning from [22] to the deep RL setting. The main idea is to use a different network for the action selection and evaluation steps. [51] modifies the experience replay buffer sampling to prioritize certain tuples, and [58] adds a dueling architecture to double DQN that combines two components, an estimator for the state value function and an estimator for the statedependent action advantage function. Other examples include bootstrapped DQN [48], amortized Q-learning [56], distributional RL [5], and rainbow DQN [26]. Our approach, WCDQN, is also an enhancement of DQN, but differ from the above works in that our focus is on modifying DQN to exploit the structure of a class of important problems that are otherwise intractable, while the existing papers focus on improvements made to certain components of the DQN algorithm (e.g., network architecture, experience replay buffer, exploration strategy). In particular, it should be possible to integrate the main ideas of WCDQN into variations of DQN without much additional work. Use of constraints and projections in RL. WCDQN relies on constraining the learned Q-function to satisfy a learned upper bound. The work of [25] uses a similar constrained optimization approach to enforce upper and lower bounds on the optimal action value function in DQN. Their bounds are derived by exploiting multistep returns of a general MDP, while ours are due to dynamically-computed Lagrangian relaxations. [25] also does not provide any convergence guarantees for their approach. In addition, [16] proposed a convergent variant of Q-learning that leverages upper and lower bounds derived using the information relaxation technique of [12] to improve performance of tabular Q-learning. Although our work shares the high-level idea of bounding Q-learning iterates, [16] focused on problems with partially known transition models (which are necessary for information relaxation) and the approach did not readily extend to the function approximation setting. Besides focusing on a different set of problems (WCMDPs), our proposed approach is model-free and naturally integrates with DQN. 3 Preliminaries In this section, we give some background on WCMDPs, Q-learning, DQN, and the Lagrangian relaxation approach. All proofs throughout the rest of the paper are given in Appendix A. 3.1 Weakly Coupled MDPs We study an infinite horizon WCMDP with state space S = X W and finite action space A, where X is the endogenous part (i.e., affected by the agent s actions) and W is the exogenous part (i.e., unaffected by the agent s actions) of the full state space. We use the general setup of WCMDPs from [11]. A WCMDP can be decomposed into N subproblems. The state space of subproblem i is denoted by Si = Xi W and the action space is denoted by Ai, such that X = N i=1Xi and A = N i=1Ai. In each period, the decision maker observes an exogenously and independently evolving state w W, along with the endogenous states x = (x1, x2, . . . , x N), where xi Xi is associated with subproblem i. Note that w is shared by all of the subproblems, and this is reflected in the notation we use throughout the paper, where s = (x, w) S represents the full state and si = (xi, w) is the state of subproblem i. In addition to the exogenous state w being shared across subproblems, there also exist L linking or coupling constraints that connect the subproblems: they take the form PN i=1 d(si, ai) b(w), where d(si, ai), b(w) RL and ai Ai is the component of the action associated with subproblem i. The set of feasible actions for state s is given by A(s) = a A : i=1 d(si, ai) b(w) . (1) After observing state s = (x, w), the decision maker selects a feasible action a A(s). The transition probabilities for the endogenous component is denoted p(x | x, a) and we assume that transitions are conditionally independent across subproblems: p(x | x, a) = ΠN i=1 pi(x i | xi, ai), where pi(x i | xi, ai) are the transition probabilities for subproblem i. The exogenous state transitions according to q(w | w). Next, let ri(si, ai) be the reward of subproblem i and let r(s, a) = {ri(si, ai)}N i=1. The reward of the overall system is additive: r(s, a) = PN i=1 ri(si, ai). Given a discount factor γ [0, 1) and a feasible policy π : S A that maps each state s to a feasible action a A(s), the value (cumulative discounted reward) of following π when starting in state s and taking a first action a is given by the action-value function Qπ(s, a) = E P t=0 γtr(st, at) | π, s0 = s, a0 = a . Our goal is to find an optimal policy π , i.e., one that maximizes V π(s) = Qπ(s, π(s)). We let Q (s, a) = maxπ Qπ(s, a) and V (s) = maxπ V π(s) be the optimal action-value and value functions, respectively. It is well-known that the optimal policy selects actions in accordance to π (s) = arg maxa Q (s, a) and that the Bellman recursion holds: Q (s, a) = r(s, a) + γ E h maxa A(s ) Q (s , a ) i , (2) where s = (x , w ) is distributed according to p( | x, a) and q( | w). 3.2 Q-learning and DQN The Q-learning algorithm of [59] is a tabular approach that attempts to learn the optimal action-value function Q using stochastic approximation on (2). Using a learning rate αn, the update of the approximation Qn from iteration n to n + 1 is: Qn+1(sn, an) = Qn(sn, an) + αn(sn, an) yn Qn(sn, an) , where yn = rn + γ maxa Qn(sn+1, a ) is the target value, computed using the observed reward rn at (sn, an), the transition to sn+1, and the current value estimate Qn. The DQN approach of [43] approximates Q via a neural network Q(s, a; θ) with network weights θ. The loss function used to learn θ is directly based on minimizing the discrepancy between the two sides of (2): l(θ) = Es,a ρ h y Q(s, a; θ) 2i , where y = r(s, a) + γ E maxa Q(s , a ; θ ) , θ are frozen network weights from a previous iteration, and ρ is a behavioral distribution [43]. In practice, we sample experience tuples (sn, an, rn, sn+1) from a replay buffer and perform a stochastic gradient update: θn+1 = θn + αn yn Q(sn, an; θ) θQ(sn, an; θ), with yn = rn + γ maxa Q(sn+1, a ; θ ). Note the resemblance of this update to that of Q-learning. 3.3 Lagrangian Relaxation The Lagrangian relaxation approach decomposes WCMDPs by relaxing the linking constraints to obtain separate, easier-to-solve subproblems [1]. The main idea is to dualize the linking constraints PN i=1 d(si, ai) b(w) using a penalty vector λ RL +. The result is an augmented objective consisting of the original objective plus additional terms that penalize constraint violations. The Bellman equation of the relaxed MDP in (2) is given by: Qλ(s, a) = r(s, a) + λ b(w) i=1 d(si, ai) + γ E h maxa A Qλ(s , a ) i . (3) With the linking constraints removed, this relaxed MDP can be decomposed across subproblems, so we are able to define the following recursion for each subproblem i: Qλ i (si, ai) = ri(si, ai) λ d(si, ai) + γ E max a i Ai Qλ i (s i, a i) . (4) It is well-known from classical results that any penalty vector λ 0 produces an MDP whose optimal value function is an upper bound on the V (s) [24, 1]. The upcoming proposition is a small extension of these results to the case of action-value functions, which is necessarily for Q-learning. Proposition 1. For any λ 0 and s S, it holds that Q (s, a) Qλ(s, a) for any a A(s). In addition, the Lagrangian action-value function of (3) satisfies Qλ(s, a) = λ B(w) + i=1 Qλ i (si, ai) (5) where Qλ i (si, ai) is as defined in (4) and B(w) satisfies the recursion B(w) = b(w) + γ E B(w ) , (6) with the exogenous next state w is distributed according to q( | w). The first part of the proposition is often referred to as weak duality and the second part shows how the Lagrangian relaxation can be solved by decomposing it across subproblems, dramatically reducing the computational burden. The tightest upper bound is the solution of the Lagrangian dual problem, Qλ (s, a) = minλ 0 Qλ(s, a), where λ is minimizer. 4 Weakly Coupled Q-learning In this section, we introduce the tabular version of our RL algorithm, called weakly coupled Q-learning (WCQL), which will illustrate the main concepts of the deep RL version, WCDQN. We first state an assumption on when the linking constraint (1), which determines the feasible actions given a state, is observed. Assumption 1 (Linking constraint observability; general setting). Suppose that upon landing in a state s = (x, w), the agent observes the possible constraint left-hand-side values d(si, ) for every i, along with the constraint right-hand-side b(w) RL. Under Assumption 1, the agent is able to determine the feasible action set A(s) upon landing in state s. Accordingly, it can always take a feasible action. In many cases, it is known in advance that the feasible action set is of the multi-action RMAB form: there is a single linking constraint (i.e., L = 1) and the left-hand-side is the sum of subproblem actions (i.e., d(si, ai) = ai). In that case, Assumption 1 reduces to the following simpler statement, which we state for completeness. Assumption 1 (Linking constraint observability; multi-action RMAB setting). Suppose that we are in a multi-action RMAB setting. When the agent lands in a state s = (x, w), it observes the constraint right-hand-side b(w) R. In the numerical example applications of Section 6, for illustrative simplicity, we choose to focus on single-constraint settings where Assumption 1 is applicable. Note that the difficulty of WCMDPs is largely determined by the number of subproblems and the size of the feasible set compared to the full action space, not necessarily by the number of linking constraints. In each of our example applications, Assumption 1 naturally holds: for example, in the EV charging problem, there are a limited number of available charging stations (which is always observable). An important part of WCQL is to track an estimate of Qλ (s, a), the result of the Lagrangian dual problem. To approximate this value, we replace the minimization over all λ 0 by optimization over a finite set of possible multipliers Λ, which we consider as an input to our algorithm. In practice, we find that it is most straightforward to use λ = λ 1, where 1 RL is the all ones vector and λ R, but from the algorithm s point of view, any set Λ of nonnegative multipliers will do. We denote an experience tuple for the entire WCMDP by τ = (s, a, r, b, s ). Similarly, we let τi = (si, ai, ri, s i) be the experience relevant to subproblem i, as described in (4). Note that b is excluded from τi because it does not enter subproblem Bellman recursion. 4.2 Algorithm Description The WCQL algorithm can be decomposed into three main steps. Subproblems and subagents. First, for each subproblem i {1, 2, . . . , N} and every λ Λ, we attempt to learn an approximation of Qλ i (si, ai) from (4), which are the Q-values of the unconstrained subproblem associated with λ. We do this by running an instance of Q-learning with learning rate βn. Letting Qλ i,n be the estimate at iteration n, the update is given by: Qλ i,n+1(si, ai) = Qλ i,n(si, ai) + βn(si, ai) yλ i,n Qλ i,n(si, ai) , (7) where the target value is defined as yλ i,n = ri(si, ai) λ d(si, ai) + γ maxa i Qλ i,n(s i, a i). Note that although we are running several Q-learning instances, they all make use of a common experience tuple τ split across subproblems, where subproblem i receives the portion τi. We remind the reader that each subproblem is dramatically simpler than the full MDP, since it operates on smaller state and action spaces (Si and Ai) instead of S and A. We refer to the subproblem Q-learning instances as subagents. Therefore, each subagent is associated with a subproblem i and a penalty λ Λ and aims to learn Qλ i . Learning the Lagrangian bounds. Next, at the level of the main agent, we combine the approximations Qλ i,n+1 learned by the subagents to form an estimate of the Lagrangian action-value function Qλ, as defined in (5). To do so, we first estimate the quantity B(w) of Proposition 1. This can be done using a stochastic approximation step with a learning rate ηn, as follows: Bn+1(w) = Bn(w) + ηn(w) b(w) + γBn(w ) Bn(w) , (8) where we recall that w and w come from the experience tuple τ, embedded within s and s . Now, using Proposition 1, we approximate Qλ(s, a) using Qλ n+1(s, a) = λ Bn+1(w) + i=1 Qλ i,n+1(si, ai). (9) Finally, we estimate an upper bound on Q by taking the minimum over Λ: Qλ n+1(s, a) = min λ Λ Qλ n+1(s, a). (10) Q-learning guided by Lagrangian bounds. We would now like to make use of the learned upper bound Qλ n+1(s, a) when performing Q-learning on the full problem. Denote the WCQL estimate of Q at iteration n by Q n. We first make a standard update towards an intermediate value Qn+1 using learning rate αn: Qn+1(s, a) = Q n(s, a) + αn(s, a) yn Q n(s, a) . (11) where yn = r(s, a) + γ maxa A(s ) Q n(s , a ). To incorporate the bounds that we previously estimated, we then project Qn+1(s, a) to satisfy the estimated upper bound: Q n+1(s, a) = Qλ n+1(s, a) Qn+1(s, a), (12) where a b = min{a, b}. The agent now takes an action in the environment using a behavioral policy, such as the ϵ-greedy policy on Q n+1(s, a). The motivation behind this projection is as follows: since the subproblems are significantly smaller in terms of state and action spaces compared to the main problem, the subagents are expected to quickly converge. As a result, our upper bound estimates will get better, improving the the action-value estimate of the main Q-learning agent through the projection step. In addition, WCQL can enable a sort of generalization to unseen states by leveraging the weakly-coupled structure. The following example illustrates this piece of intuition. Example 1. Suppose a WCMDP has N = 3 subproblems with Si = {1, 2, 3} and Ai = {1, 2} for each i, leading to 33 23 = 216 total state action pairs. For the sake of illustration, suppose that the agent has visited states s = (1, 1, 1), s = (2, 2, 2), and s = (3, 3, 3) and both actions from each of these states. This means that from the perspective of every subproblem i, the agent has visited all state-action pairs in Si Ai, which is enough information to produce an estimate of Qλ i (si, ai) for all (si, ai) and, interestingly, an estimate of Qλ (s, a) for every (s, a), despite having visited only a small fraction (6/216) of the possible state-action pairs. This allows the main Q-learning agent to make use of upper bound information at every state-action pair via the projection step (12). The main intuition is that these upper bound values are likely to be more sensible than a randomly initialized value, and therefore, can aid learning. The above example is certainly contrived, but hopefully illustrates the benefits of decomposition and subsequent projection. We note that, especially in settings where the limiting factor is the ability to collect enough experience, one can trade-off extra computation to derive these bounds and improve RL performance without the need to collect additional experience. The full pseudo-code of the WCQL algorithm is available in Appendix B. 4.3 Convergence Analysis In this section, we show that WCQL converges to Q with probability one. First, we state a standard assumption on learning rates and state visitation. Assumption 2. We assume the following: (i) P n=0 αn(s, a) = , P n=0 α2 n(s, a) < for all (s, a) S A; (ii) analogous conditions to (i) hold for {βn(si, ai)}n and {ηn(w)}n, and (iii) the behavioral policy is such that all state-action pairs (s, a) are visited infinitely often w.p.1. Theorem 1. Under Assumptions 1 and 2, the following statements hold with probability one. (i) For each i and λ Λ, limn Qλ i,n(si, ai) = Qλ i (si, ai) for all (si, ai) Si Ai. (ii) For each λ Λ, limn Qλ n(s, a) Q (s, a) for all (s, a) S A. (iii) limn Q n(s, a) = Q (s, a) for all (s, a) S A. Theorem 1 ensures that each subagent s value functions converge to the subproblem optimal value. Furthermore, it shows that asymptotically, the Lagrangian action-value function given by (9) will be an upper bound on the optimal action-value function Q of the full problem and that our algorithm will converge to Q . 5 Weakly Coupled DQN In this section, we propose our main algorithm weakly coupled DQN (WCDQN), which integrates the main idea of WCQL into a function approximation setting. WCDQN guides DQN using Lagrangian relaxation bounds, implemented using a constrained optimization approach. Networks. Analogous to WCQL, WCDQN has a main network Q (s, a; θ) that learns the action value of the full problem. In addition to the main network, WCDQN uses a subagent network Qλ i (si, ai; θU) network to learn the subproblem action-value functions Qλ i . As in standard DQN, we also have θ and θ U , which are versions of θ and θU frozen from a previous iteration and used for computing target values [43]. The inputs to this network are (i, λ, si, ai), meaning that we can use a single network to learn the action-value function for all subproblems and λ Λ simultaneously. The Lagrangian upper bound and the best upper bound estimates are: Qλ(s, a; θ U ) = λ B(w) + i=1 Qλ i (si, ai; θ U ) and Qλ (s, a; θ U ) = min λ Λ Qλ(s, a; θ U ). (13) Loss functions. Before diving into the training process, we describe the loss functions used to train each network, as they are instructive toward understanding the main idea behind WCDQN (and how it differs from standard DQN). Consider a behavioral distribution ρ for state-action pairs and a distribution µ over the multipliers Λ. l U(θU) = Es,a ρ,λ µ h PN i=1 yλ i Qλ i (si, ai; θU) 2i , (14) where the (ideal) target value is yλ i = ri(si,ai) λai + γ E maxa i Ai Qλ i (s i, a i; θ U ) . (15) For the main agent, we propose a loss function that adds a soft penalty for violating the upper bound: l(θ) = Es,a ρ,λ µ h y Q (s, a; θ) 2 + κU Q (s, a; θ) y U 2 + where ( )+ = max( , 0), κU is a coefficient for the soft penalty, and y = r(s, a) + γE maxa A(s ) Q (s , a ; θ ) , (17) y U = r(s, a) + γE h maxa A Qλ (s , a ; θ U ) i . (18) The penalty encourages the network to satisfy the bounds obtained from the Lagrangian relaxation. Training process. The training process resembles DQN, with a few modifications. At any iteration, we first take an action using an ϵ-greedy policy using the main network over the feasible actions, store the obtained transition experience τ in the buffer, and update the estimate of B(w) following (8).2 Each network is then updated by taking a stochastic gradient descent step on its associated loss function, where the expectations are approximated by sampling minibatches of experience tuples τ and λ. The penalty coefficient κU can either be held constant to a positive value or annealed using a schedule throughout the training. The full details are shown in Algorithm 1 and some further details are given in Appendix C. 6 Numerical Experiments In this section, we evaluate our algorithms on three different WCMDPs. First, we evaluate WCQL on an electric vehicle (EV) deadline scheduling problem with multiple charging spots and compare its performance with several other tabular algorithms: Q-learning (QL), Double Q-learning (Double-QL) [22], speedy Q-learning (SQL) [4], bias-corrected Q-learning (BCQL) [40], and Lagrange policy Qlearning (Lagrangian QL) [34]. We then evaluate WCDQN on two problems, multi-product inventory control and online stochastic ad matching, and compare against standard DQN, Double-DQN, and the optimality-tightening DQN (OTDQN) algorithm3 of He et al. [25] as baselines. Further details on environment and algorithmic parameters are in Appendix D. EV charging deadline scheduling [63]. In this problem, a decision maker is responsible for charging electric vehicles (EV) at a charging service center that consists of N = 3 charging spots. An EV enters the system when a charging spot is available and announces the amount of electricity it needs to be charged, denoted Bt, along with the time that it will leave the system, denoted Dt. The decision maker also faces exogenous, random Markovian processing costs ct. At each period, the action is to decide which EVs to charge in accordance with the period s capacity constraint. For each unit of power provided to an EV, the service center receives a reward 1 ct. However, if the EV leaves the system with an unfulfilled charge, a penalty is assessed. The goal is to maximize the revenue minus penalty costs. 2Here we use a tabular representation for B(w) since our example applications do not necessarily have a large exogenous space W. When required, WCDQN can be extended to use function approximation (i.e., neural networks) to learn B(w). 3We include OTDQN as a baseline because it also makes use of constrained optimization during training. Algorithm 1 Weekly Coupled DQN 1: Input: Lagrange multiplier set Λ and a distribution µ over Λ, penalty coefficient κU, initialized replay buffer D, target network update frequency Ctarget, initial state distribution S0. 2: Initialize main Q-network Q ( , ; θ) and subagent network Qλ i ( , ; θU) 3: Initialize target network weights θ = θ and θ U = θU. 4: for n = 0, 1, 2, . . . do 5: Take an ϵ-greedy behavioral action an with respect to the main network Q (sn, a; θ). 6: Store the observed transition τn into the replay buffer D. 7: // Update subagents and combine results to estimate upper bound 8: Sample a minibatch of transitions τ from D along with a sample of λ from µ. 9: for i = 1, 2, . . . , N do 10: Compute targets yλ i using (15) and take an optimization step on subagent loss (14). 11: end for 12: Update right-hand-side estimate Bn+1(wn) according to (8). 13: Using (13), combine subproblems to obtain Lagrangian upper bound Qλ (s, a; θ U ). 14: // Main agent update with upper bound penalty loss 15: Compute y and y U using (17) and (18), then take an optimization step on the main loss (16). 16: Update θ = θ and θ U = θU every Ctarget steps. 17: end for Multi-product inventory control with an exogenous production rate [27]. Consider the problem of resource allocation for a facility that manufactures N = 10 products. Each product i has an independent exogenous demand given by Di, i = 1, . . . , N. To meet these demands, the products are made to stock. Limited storage Ri is available for each product, and holding a unit of inventory per period incurs a cost hi. Unmet demand is backordered at a cost bi if the number of backorders is less than the maximum number of allowable backorders Mi. Otherwise, it is lost with a penalty cost li. The DM needs to allocate a resource level ai {0, 1, . . . , U} for product i from a shared finite resource quantity U in response to changes in the stock level of each product, denoted by xi Xi = { Mi, Mi + 1, . . . , Ri}. A negative stock level corresponds to the number of backorders. Allocating a resource level ai yields a production rate given by a function ρi(ai, pi) where pi is an exogenous Markovian noise that affects the production rate. The goal is to minimize the total cost, which consists of holding, back-ordering, and lost sales costs. Online stochastic ad matching [18]. We study the problem of matching N = 6 advertisers to arriving impressions. In each period, an impression of type et arrives according to a Markov chain. An action at,i {0, 1} assigns impression et to advertiser i, with a constraint that exactly one advertiser is selected: PN i=1 at,i = 1. Advertiser states represent the number of remaining ads to display and evolves according to st+1,i = st,i at,i. The objective is to maximize the discounted sum of expected rewards for all advertisers. In Figure 2, we show how WCQL s projection method helps it learn a more accurate Q function more quickly than competing tabular methods. The first panel, Figure 2A, shows an example evolution of WCQL s projected value function Q , along with the evolution of the upper bound. We compare this to the evolution of the action-value function in absence of the projection step. In the second panel, Figure 2B, we plot the relative error between the learned value functions of various algorithms compared to the optimal value function. Both plots are from the EV charging example. Detailed descriptions of the results are given in the figure s caption. The results of our numerical experiments are shown in Figure 3. We see that in both the tabular and the function approximation cases, our algorithms outperformed the baselines, with WCQL and WCDQN achieving the best mean episode total rewards amongst all problems. From Figure 3A, we see that although the difference between WCQL and Lagrangian QL is small towards the end of the training process, there are stark differences earlier on. In particular, the performance curve of WCQL shows significantly lower variance, suggesting more robustness across random seeds. Given that WCQL and Lagrangian QL differ only in the projection step, we can attribute the improved stability to the guidance provided by the Lagrangian bounds. Figure 3B shows that for the multi-product inventory problem, the OTDQN, DQN, and Double DQN baselines show extremely noisy performance compared to WCDQN, whose significantly better and stable performance is likely *[s, a] Q[s, a] QL[s, a] (A) WCQL learning process illustration 0 5 10 15 20 25 30 Episodes (x200) Relative Error QL Double-QL SQL BCQL WCQL (B) Relative error comparison Figure 2: Behavior of WCQL s learning process compared to other tabular methods. In Figure 2A, we show an example of the evolution of quantities associated with WCQL for a randomly selected state-action pair in the EV charging problem: upper bound (blue), WCQL Q-value (orange line with x marks indicating the points projected by the bound), standard Q-learning (green) and the optimal action-value (red). Note that sometimes the orange line appears above the blue line since WCQL s projection step is asynchronous, i.e., the projections are made only if the state is visited by the behavioral policy. Notice at the last marker, the bound moved the Q-value in orange to a good value, relatively nearby the optimal value. WCQL s Q-value then evolves on its own and eventually converges. On the other hand, standard QL (green), which follows the same behavior policy as WCQL, is still far from optimal. In Figure 2B, we show the relative error, defined as Vn V 2/ V 2, where Vn and V are the state value functions derived from Q-iterate on iteration n and Q , respectively. WCQL exhibits the steepest decline in relative error compared to the other algorithms. Note that since Lagrangian QL acts based on the Q-values of the subproblems, there is no Q-value for this algorithm on which to compute a relative error. due to the use of faster converging subagents and better use of the collected experience. Similarly, in the online stochastic ad matching problem, WCDQN significantly outperforms all the baselines. 0 10 20 30 40 50 60 Episodes (x100) Total Reward QL Double-QL SQL BCQL Lagrangian QL WCQL (A) EV charging 0 10 20 30 40 50 Episodes (x100) Total Reward DQN Double DQN OTDQN WCDQN (B) Multi-product inventory control 0 20 40 60 80 100 Episodes (x100) Total Reward DQN Double DQN OTDQN WCDQN (C) Online ad matching Figure 3: Benchmarking results for the WCQL (EV charging) and WCDQN (multi-product inventory control, online ad matching) against baseline methods. The plots show mean total rewards and their 95% confidence intervals across 5 independent replications. 7 Conclusion In this study, we propose the WCQL algorithm for learning in weakly coupled MDPs and we show that our algorithm converges to the optimal action-value function. We then propose WCDQN, which extends the idea behind the WCQL algorithm to the function approximation case. Our algorithms are model-free and learn upper bounds on the optimal action-value using a combination of a Lagrangian relaxation and Q-learning. These bounds are then used within a constrained optimization approach to improve performance and make learning more efficient. Our approaches significantly outperforms competing approaches on several benchmark environments. Acknowledgments and Disclosure of Funding This research was supported in part by the University of Pittsburgh Center for Research Computing, RRID:SCR_022735, through the resources provided. Specifically, this work used the H2P cluster, which is supported by NSF award number OAC-2117681. [1] Daniel Adelman and Adam J Mersereau. Relaxations of weakly coupled stochastic dynamic programs. Operations Research, 56(3):712 727, 2008. [2] Eitan Altman. Constrained Markov decision processes. Routledge, 2021. [3] Konstantin E Avrachenkov and Vivek S Borkar. Whittle index based Q-learning for restless bandits with average reward. Automatica, 139:110186, 2022. [4] M. G. Azar, R. Munos, M. Ghavamzadaeh, and H. J. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems 24, 2011. [5] Marc G Bellemare, Will Dabney, and Mark Rowland. Distributional reinforcement learning. MIT Press, 2023. [6] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming, volume 5. Athena Scientific Belmont, MA, 1996. [7] Dimitris Bertsimas and Adam J Mersereau. A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120 1135, 2007. [8] Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. ar Xiv preprint ar Xiv:1906.01786, 2019. [9] Craig Boutilier and Tyler Lu. Budget allocation using weakly coupled, constrained Markov decision processes. 2016. [10] Craig Boutilier, Richard Dearden, and Moisés Goldszmidt. Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1-2):49 107, 2000. [11] David B Brown and Jingwei Zhang. On the strength of relaxations of weakly coupled stochastic dynamic programs. Operations Research, 2022. [12] David B Brown, James E Smith, and Peng Sun. Information relaxations and duality in stochastic dynamic programs. Operations Research, 58(4-part-1):785 801, 2010. [13] Hyeong Soo Chang, Pedram Jaefari Fard, Steven I Marcus, and Mark Shayman. Multitime scale Markov decision processes. IEEE Transactions on Automatic Control, 48(6):976 987, 2003. [14] Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665 1674. PMLR, 2019. [15] Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, 110(9):2419 2468, 2021. [16] Ibrahim El Shar and Daniel Jiang. Lookahead-bounded Q-learning. In International Conference on Machine Learning, pages 8665 8675. PMLR, 2020. [17] Eyal Even-Dar, Yishay Mansour, and Peter Bartlett. Learning rates for Q-learning. Journal of Machine Learning Research, 5(1), 2003. [18] Jon Feldman, Aranyak Mehta, Vahab Mirrokni, and Shan Muthukrishnan. Online stochastic matching: Beating 1-1/e. In 2009 50th Annual IEEE Symposium on Foundations of Computer Science, pages 117 126. IEEE, 2009. [19] Jing Fu, Yoni Nazarathy, Sarat Moka, and Peter G Taylor. Towards Q-learning the Whittle index for restless bandits. In 2019 Australian & New Zealand Control Conference (ANZCC), pages 249 254. IEEE, 2019. [20] Kevin D Glazebrook, Diego Ruiz-Hernandez, and Christopher Kirkbride. Some indexable families of restless bandit problems. Advances in Applied Probability, 38(3):643 672, 2006. [21] Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual Markov decision processes. ar Xiv preprint ar Xiv:1502.02259, 2015. [22] Hado Hasselt. Double Q-learning. Advances in Neural Information Processing Systems, 23, 2010. [23] Milos Hauskrecht, Nicolas Meuleau, Leslie Pack Kaelbling, Thomas Dean, and Craig Boutilier. Hierarchical solution of Markov decision processes using macro-actions. In Uncertainty in Artificial Intelligence, pages 220 229, 1998. [24] Jeffrey Thomas Hawkins. A Langrangian decomposition approach to weakly coupled dynamic optimization problems and its applications. Ph D thesis, Massachusetts Institute of Technology, 2003. [25] Frank S He, Yang Liu, Alexander G Schwing, and Jian Peng. Learning to play in a day: Faster deep reinforcement learning by optimality tightening. ar Xiv preprint ar Xiv:1611.01606, 2016. [26] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. [27] David J Hodge and Kevin D Glazebrook. Dynamic resource allocation in a multi-product make-to-stock production system. Queueing Systems, 67(4):333 364, 2011. [28] Yu-Pin Hsu. Age of information: Whittle index for scheduling stochastic arrivals. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 2634 2638. IEEE, 2018. [29] Weici Hu and Peter Frazier. An asymptotically optimal index policy for finite-horizon restless bandits. ar Xiv preprint ar Xiv:1707.00205, 2017. [30] T. Jaakkola, M. I. Jordan, and S. P. Singh. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems, pages 703 710, 1994. [31] Daniel R Jiang and Warren B Powell. An approximate dynamic programming algorithm for monotone value functions. Operations Research, 63(6):1489 1511, 2015. [32] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137 2143. PMLR, 2020. [33] Michael Kearns and Daphne Koller. Efficient reinforcement learning in factored MDPs. In IJCAI, volume 16, pages 740 747, 1999. [34] Jackson A Killian, Arpita Biswas, Sanket Shah, and Milind Tambe. Q-learning Lagrange policies for multi-action restless bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 871 881, 2021. [35] Jackson A Killian, Lily Xu, Arpita Biswas, and Milind Tambe. Robust restless bandits: Tackling interval uncertainty with deep reinforcement learning. ar Xiv preprint ar Xiv:2107.01689, 2021. [36] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [37] Sumit Kunnumkal and Huseyin Topaloglu. Using stochastic approximation methods to compute optimal base-stock levels in inventory control problems. Operations Research, 56(3):646 664, 2008. [38] Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003. [39] Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. RL for latent MDPs: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34:24523 24534, 2021. [40] D. Lee and W. B. Powell. Bias-corrected Q-learning with multistate extension. IEEE Transactions on Automatic Control, 2019. [41] Keqin Liu and Qing Zhao. Indexability of restless bandit problems and optimality of Whittle index for dynamic multichannel access. IEEE Transactions on Information Theory, 56(11): 5547 5567, 2010. [42] Rahul Meshram, D Manjunath, and Aditya Gopalan. On the Whittle index for restless multiarmed hidden Markov bandits. IEEE Transactions on Automatic Control, 63(9):3046 3053, 2018. [43] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. [44] Aditya Mohan, Amy Zhang, and Marius Lindauer. Structure in reinforcement learning: A survey and open problems. ar Xiv preprint ar Xiv:2306.16021, 2023. [45] Selvaprabu Nadarajah and Andre Augusto Cire. Self-adapting network relaxations for weakly coupled Markov decision processes. Available at SSRN, 2021. [46] Khaled Nakhleh, Santosh Ganji, Ping-Chun Hsieh, I Hou, Srinivas Shakkottai, et al. Neurwin: Neural Whittle index network for restless bandits via deep RL. Advances in Neural Information Processing Systems, 34:828 839, 2021. [47] Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored MDPs. Advances in Neural Information Processing Systems, 27, 2014. [48] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. Advances in Neural Information Processing Systems, 29, 2016. [49] Warren Powell, Andrzej Ruszczy nski, and Huseyin Topaloglu. Learning algorithms for separable approximations of discrete stochastic optimization problems. Mathematics of Operations Research, 29(4):814 836, 2004. [50] Francisco Robledo, Vivek Borkar, Urtzi Ayesta, and Konstantin Avrachenkov. QWI: Q-learning with whittle index. ACM SIGMETRICS Performance Evaluation Review, 49(2):47 50, 2022. [51] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. ar Xiv preprint ar Xiv:1511.05952, 2015. [52] Lauren N Steimle, David L Kaufman, and Brian T Denton. Multi-model Markov decision processes. IISE Transactions, 53(10):1124 1139, 2021. [53] Kalyan Talluri and Garrett Van Ryzin. An analysis of bid-price controls for network revenue management. Management Science, 44(11-part-1):1577 1593, 1998. [54] Huseyin Topaloglu. Using Lagrangian relaxation to compute capacity-dependent bid prices in network revenue management. Operations Research, 57(3):637 649, 2009. [55] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16 (3):185 202, 1994. [56] Tom Van de Wiele, David Warde-Farley, Andriy Mnih, and Volodymyr Mnih. Q-learning in enormous action spaces via amortized approximate maximization. ar Xiv preprint ar Xiv:2001.08116, 2020. [57] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016. [58] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, pages 1995 2003. PMLR, 2016. [59] C. J. C. H. Watkins. Learning from Delayed Rewards. Ph D thesis, King s College, Cambridge, UK, 1989. [60] Richard R Weber and Gideon Weiss. On an index policy for restless bandits. Journal of Applied Probability, 27(3):637 648, 1990. [61] Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of Applied Probability, 25(A):287 298, 1988. [62] Guojun Xiong, Shufan Wang, and Jian Li. Learning infinite-horizon average-reward restless multi-action bandits via index awareness. Advances in Neural Information Processing Systems, 35:17911 17925, 2022. [63] Zhe Yu, Yunjian Xu, and Lang Tong. Deadline scheduling as restless bandits. IEEE Transactions on Automatic Control, 63(8):2343 2358, 2018. [64] Xiangyu Zhang and Peter I Frazier. Near-optimality for infinite-horizon restless bandits with many arms. ar Xiv preprint ar Xiv:2203.15853, 2022. [65] Jiahong Zhou, Shunhui Mao, Guoliang Yang, Bo Tang, Qianlong Xie, Lebin Lin, Xingxing Wang, and Dong Wang. RL-MPCA: A reinforcement learning based multi-phase computation allocation approach for recommender systems. In Proceedings of the ACM Web Conference 2023, pages 3214 3224, 2023. Appendix to Weakly Coupled Deep Q-Networks A.1 Proof of Proposition 1 Proof. We prove part the first part of the proposition (weak duality) by induction. First, define Q 0(s, a) = r(s, a) and Qλ 0(s, a) = r(s, a) + λ h b(w) PN i=1 d(si, ai) i , and suppose we run value iteration for both systems: Q t+1(s, a) = r(s, a) + γ E maxa A(s ) Q t (s , a ) , Qλ t+1(s, a) = r(s, a) + λ h b(w) PN i=1 d(si, ai) i + γ E maxa A Qλ t (s , a ) . It is well-known that, by the value iteration algorithm s convergence, Q (s, a) = lim t Q t (s, a) and Qλ(s, a) = lim t Qλ t (s, a). Consider a state s S and a feasible action a A(s). We have, Qλ 0(s, a) = r(s, a) + λ h b(w) PN i=1 d(si, ai) i r(s, a) = Q 0(s, a). Suppose Qλ t (s, a) Q t (s, a) holds for all s S and a A(s) for some t > 0 (induction hypothesis). Then, Qλ t+1(s, a) = r(s, a) + λ h b(w) PN i=1 d(si, ai) i + γ E maxa A Qλ t (s , a ) r(s, a) + λ h b(w) PN i=1 d(si, ai) i + γ E maxa A(s ) Q t (s , a ) r(s, a) + γ E maxa A(s ) Q t (s , a ) = Q t+1(s, a). Thus, it follows that Qλ(s, a) Q (s, a). For the proof of the second part of the proposition, define B0(w) = b(w) and Bt+1(w) = b(w) + γ E Bt(w ) . We use an induction proof. We have for all (s, a) S A, Qλ 0(s, a) = r(s, a) + λ " i=1 d(si, ai) h ri(si, ai) λ d(si, ai) i + λ b(w) = i=1 Qλ 0,i(si, ai) + λ B0(w), where Qλ 0,i(si, ai) = ri(si, ai) λ d(si, ai). Similarly, for all (s, a) S A, Qλ 1(s, a) = r(s, a) + λ " i=1 d(si, ai) + γ E maxa A Qλ 0(s , a ) h ri(si, ai) λ d(si, ai) i + λ b(w) + γ E i=1 Qλ 0,i(s i, a i) + λ B0(w ) h ri(si, ai) λ d(si, ai) + γ E maxa i Ai Qλ 0,i(s i, a i) i + λ b(w) + γ E B0(w ) i=1 Qλ 1,i(si, ai) + λ B1(w). Continuing in this manner, we arrive at Qλ t (s, a) = PN i=1 Qλ t,i(si, ai) + λ Bt(w). Finally, we have Qλ(s, a) = lim t Qλ t (s, a) i=1 Qλ t,i(si, ai) + λ Bt(w) = i=1 Qλ i (si, ai) + λ B(w), which follows by the convergence of value iteration. A.2 Proof of Theorem 1 Proof. First, we define the Bellman operator H: (HQ )(s, a) = r(s, a) + γE [maxa Q (s , a )] , which is known to be a γ-contraction mapping. Next we define the random noise term ξn(s, a) = γ maxa Q n(s , a ) γE [maxa Q n(s , a )] . (19) Analogously, for subproblem i {1, . . . , N}, define the subproblem Bellman operator (Hi Qλ i )(si, ai) = ri(si, ai) λ d(si, ai) + γE maxa i Qλ i (s i, a i) , and random noise term ξi,n(si, ai) = γ maxa i Q i,n(s i, a i) γE maxa i Q i,n(s i, a i) . (20) The update rules of WCQL can then be written as Qn+1(s, a) = (1 αn(s, a)) Q n(s, a) + αn(s, a) [(HQ n)(s, a) + ξn+1(s, a)] , Qλ i,n+1(si, ai) = Qλ i,n(si, ai) + βn(si, ai) (Hi Q i,n)(si, ai) + ξi,n+1(si, ai) , (21) Qλ n+1(s, a) = min λ Λ λ Bn(w) + i=1 Qλ i,n(si, ai), Q n+1(s, a) = min(Qn+1(s, a), Qλ n+1(s, a)). (22) Parts (i) and (ii). By the iteration described in (21), we know that for a fixed λ, we are running Q-learning on an auxiliary MDP with Bellman operator Hi, which encodes a reward ri(si, ai) λ d(si, ai) and the transition dynamics for subproblem i. By the standard result for asymptotic convergence of Q-learning [6], we have lim n Qλ i,n(si, ai) = Qλ i (si, ai). (23) We now prove the result in (ii): limn Qλ n(s, a) Q (s, a). Recall that Qλ n(s, a) = λ Bn(w) + i=1 Qλ i,n(si, ai). By standard stochastic approximation theory, limn Bn(w) = B(w) for all w [38]. Combining this with (23), we have limn Qλ n(s, a) = Qλ(s, a) for all (s, a), and to conclude that this limit is an upper bound on Q (s, a), we apply Proposition 1. Part (iii). Assume without loss of generality that Q (s, a) = 0 for all state-action pairs (s, a). This can be established by shifting the origin of the coordinate system. We also assume that αn(s, a) 1 for all (s, a) and n. We proceed via induction. Note that the iterates Q n(s, a) are bounded in the sense that there exists a constant D0 = Rmax/(1 γ), Rmax = max(s,a) |r(s, a)|, such that |Q n(s, a)| D0 for all (s, a) and iterations n [17]. Define the sequence Dk+1 = (γ + ϵ) Dk, such that γ + ϵ < 1 and ϵ > 0. Clearly, Dk 0. Suppose that there exists a random variable nk, representing an iteration threshold such that for all (s, a), Dk Q n(s, a) min{Dk, Qλ n (s, a)}, n nk. We will show that there exists some iteration nk+1 such that Dk+1 Q n(s, a) min{Dk+1, Qλ n (s, a)} (s, a), n nk+1, which implies that Q n(s, a) converges to Q (s, a) = 0 for all (s, a). By part (ii), we know that for all η > 0, with probability 1, there exists some finite iteration n0 such that for all n n0, Q (s, a) η Qλ n (s, a). (24) Now, we define an accumulated noise process started at nk by Wnk,nk(s, a) = 0, and Wn+1,nk(s, a) = (1 αn(s, a)) Wn,nk(s, a) + αn(s, a) ξn+1(s, a), n nk, (25) where ξn(s, a) is as defined in (19). Let Fn be the entire history of the algorithm up to the point where the step sizes at iteration n are selected. Using Corollary 4.1 in [6] which states that under Assumption 2 on the step size αn(s, a), and if E[ξn(s, a) | Fn] = 0 and E[ξ2 n(s, a) | Fn] An, where the random variable An is bounded with probability 1, the sequence Wn+1,nk(s, a) defined in (25) converges to zero, with probability 1. From our definition of the stochastic approximation noise ξn(s, a) in (19), we have E[ξn(s, a) | Fn] = 0 and E[ξ2 n(s, a) | Fn] C(1 + maxs ,a Q 2 n (s , a )), where C is a constant. Then, it follows that lim n Wn,nk(s, a) = 0 (s, a), nk. We use the following lemma from [6] to bound the accumulated noise. Lemma A.1 (Lemma 4.2 in [6]). For every δ > 0, with probability one, there exists some n such that |Wn,n (s, a)| δ, for all n n . Now, by Lemma A.1, let nk max(nk, n0) such that, for all n nk we have |Wn,nk (s, a)| γϵDk < γDk. Let νk nk such that, for all n νk, by (24) we have γϵDk γDk Qλ n (s, a). Define another sequence Yn that starts at iteration νk. Yνk(s, a) = Dk and Yn+1(s, a) = (1 αn(s, a)) Yn(s, a) + αn(s, a) γ Dk (26) Note that it is easy to show that the sequence Yn(s, a) in (26) is decreasing, bounded below by γDk, and converges to γDk as n . Now we state the following lemma. Lemma A.2. For all state-action pairs (s, a) and iterations n νk, it holds that: Yn(s, a) + Wn,νk(s, a) Q n(s, a) min{Qλ n (s, a), Yn(s, a) + Wn,νk(s, a)}. (27) Proof. We focus on the right hand side inequality, the left hand side can be proved similarly. For the base case n = νk, the statement holds because Yνk(s, a) = Dk and Wνk,νk(s, a) = 0. We assume it is true for n and show that it continues to hold for n + 1: Qn+1(s, a) = (1 αn(s, a))Q n(s, a) + αn(s, a) [(HQ n)(s, a) + ξn+1(s, a)] (1 αn(s, a)) min{Qλ n (s, a), Yn(s, a) + Wn,νk(s, a)} + αn(s, a) (HQ n)(s, a) + αn(s, a) ξn+1(s, a) (1 αn(s, a)) (Yn(s, a) + Wn,νk(s, a)) + αn(s, a) γDk + αn(s, a) ξn+1(s, a) Yn+1(s, a) + Wn+1,νk(s, a), where we used (HQ n) γ Q n γDk. Now, we have Q n+1(s, a) = min(Qλ n+1(s, a), Qn+1(s, a)) min{Qλ n+1(s, a), Yn+1(s, a) + Wn+1,νk(s, a)}. The inequality holds because Qn+1(s, a) Yn+1(s, a) + Wn+1,νk(s, a), which completes the proof. Since Yn(s, a) γDk and Wn,νk(s, a) 0, we have lim supn Q n γDk < Dk+1. Therefore, there exists some time nk+1 such that Dk+1 Q n(s, a) min{Dk+1, Qλ n (s, a)} (s, a), n nk+1, which completes the induction. B Weakly Coupled Q-learning Algorithm Description Algorithm 2 Weekly Coupled Q-learning 1: Input: Lagrange multiplier set Λ, initial state distribution S0. 2: Initialize Q-table estimates Q0, {Q0,i}N i=1. Set Q 0 = Q0. 3: for n = 0, 1, 2, . . . do 4: Take an ϵ-greedy behavioral action an with respect to Q n(sn, a). 5: // Estimate upper bound by combining subagents 6: for i = 1, 2, . . . , N do 7: Update each subproblem value functions Qλ i,n+1 according to (7). 8: end for 9: Update right-hand-side estimate Bn+1(wn) according to (8). 10: Using (9) and (10), combine subproblems to obtain Qλ n+1(sn, a) for all a A(sn). 11: // Main agent standard update, followed by projection 12: Do standard Q-learning update using (11) to obtain Qn+1. 13: Perform upper bound projection step: Q n+1(s, a) = Qλ n+1(s, a) Qn+1(s, a) 14: end for C Weakly Coupled DQN Algorithm Implementation In our implementation of WCDQN, the subproblem Qλ i -network in Algorithm 1 follows the standard network architecture as in [43], where given an input state si the network predicts the Q-values for all actions. This mandates that all the subproblems have the same number of actions. To address different subproblem action spaces, we can change the network architecture to receive the state-action pair (si, ai) as input and output the predicted Q-value. This simple change does not interfere or affect WCDQN s main idea. Our code is available at https://github.com/ibrahim-elshar/WCDQN_Neur IPS. D Numerical Experiment Details A discount factor of 0.9 is used for the EV charging problem and 0.99 for the multi-product inventory and online stochastic ad matching problems. In the tabular setting, we use a polynomial learning rate that depends on the state-action pairs visitation given by αn(s, a) = 1/νn(s, a)r, where νn(s, a) represent the number of times (s, a) has been visited up to iteration n, and r = 0.4. We also use an ϵ-greedy exploration policy, given by ϵ(s) = 1/ν(s)e, where ν(s) is the number of times the state s has been visited. We set e = 0.4. In the function approximation setting, we use an ϵ-greedy policy that decays ϵ from 1 to 0.05 after 30, 000 steps. All state-action value functions are initialized randomly. Experiments were ran on a shared memory cluster with dual 12-core Skylake CPU (Intel Xeon Gold 6126 2.60 GHz) and 192 GB RAM/node. D.1 EV charging deadline scheduling [63] In this problem, there are in total three charging spots N = 3. Each spot represents a subproblem with state (ct, Bt,i, Dt,i), where ct {0.2, 0.5, 0.8} is the exogenous electric cost, Bt,i 2 is the amount of charge required and Dt,i 4 is the remaining time until the EV leaves the system. The state space size is 36 for each subproblem. At a given period t, the action of each subproblem is whether to charge an EV occupying the charging spot at,i = 1 or not at,i = 0. A feasible action is given by PN i=1 at,i b(ct), where b(0.2) = 3, b(0.5) = 2, and b(0.8) = 1. The reward of each subproblem is given by ri (ct, Bt,i, Dt,i), at,i = (1 ct) at,i if Bt,i > 0, Dt,i > 1, (1 ct) at,i F(Bt,i at,i) if Bt,i > 0, Dt,i = 1, 0, otherwise, where F(Bt,i at,i) = 0.2 (Bt,i at,i)2 is a penalty function for failing to complete charging the EV before the deadline. The endogenous state of each subproblem evolves such that (Bt+1,i, Dt+1,i) = (Bt,i at,i, Dt,i 1) if Dt,i > 1, and (Bt+1,i, Dt+1,i) = (B, D) with probability q(D, B) if Dt,i 1, where q(0, 0) = 0.3 and q(B, D) = 0.7/11 for all B > 0 and D > 0. The exogenous state ct evolves following the transition probabilities given by: q(ct+1 | ct) = 0.4 0.3 0.3 0.2 0.5 0.3 0.6 0.2 0.2 D.2 Multi-product inventory control with an exogenous production rate [27] We consider manufacturing N = 10 products. The exogenous demand Dt,i for each product i {1, 2, . . . , 10} follows a Poisson distribution with mean value µi. The maximum storage capacity and the maximum number of allowable backorders (after which lost sales costs incur) for product i are given by Ri and Mi, respectively. The state for subproblem i is given by (xt,i, pt), where xt,i Xi = { Mi, Mi + 1, . . . , Ri} is the inventory level for product i, and pt is an exogenous and Markovian noise with support [0.8, 1]. A negative stock level corresponds to the number of backorders. For subproblem i, the action at,i is the number of resources allocated to the product i. The maximum number of resources available for all products is U = 3, so feasible actions must satisfy P Allocating a resource level at,i yields a production rate ρi(at,i, pt) = (12 pt at,i)/(5.971 + at,i). The cost function for product i is ci(pt, xt,i, at,i) and represents the sum of the holding, backorders, and lost sales costs. We let hi, bi, and li denote the per-unit holding, backorder, and lost sale costs, respectively. The cost function ci(xt,i, pt, at,i) is given by, ci(xt,i, pt, at,i) = hi(xt,i + ρi(at,i, pt))+ + bi( xt,i ρi(at,i, pt))+ + li((Dt,i xt,i ρi(at,i, pt))+ Mi)+, where (.)+ = max(., 0). We summarize the cost parameters and the mean demand for each product in Table 1. Finally, the transition for the inventory state of subproblem i is given by xt+1,i = max min(xt,i + ρi(at,i, pt) Dt,i, Ri), Mi , where the exogenous noise pt evolves according to a transition matrix sampled from a Dirichlet distribution whose parameters are each sampled (once per replication) from a Uniform(1, 5) distribution. Table 1: Multi-product inventory environment parameters Product i 1 2 3 4 5 6 7 8 9 10 Storage capacity Ri 20 30 10 15 10 10 25 30 15 10 Maximum backorders Mi 5 5 5 5 5 5 5 5 5 5 Mean demand µi 0.3 0.7 0.5 1.0 1.4 0.9 1.1 1.2 0.3 0.6 Holding cost hi 0.1 0.2 0.05 0.3 0.2 0.5 0.3 0.4 0.15 0.12 Backorder cost bi 3.0 1.2 5.15 1.3 1.1 1.1 10.3 1.05 1. 3.1 Lost sales cost li 30.1 3.3 10.05 3.9 3.7 3.6 40.3 4.5 12.55 44.1 D.3 Online stochastic ad matching [18] In this problem, a platform needs to match N = 6 advertisers to arriving impressions [18]. An impression et E = {1, 2, . . . , 5} arrives according to a discrete time Markov chain with transition probabilities given by q(et+1 | et), where each row of the transition matrix q is sampled from a Dirichlet distribution whose parameters are sampled (once per replication) from Uniform(1, 20). The action at,i {0, 1} is whether to assign impression et to advertiser i or not. The platform can assign an impression to at most one advertiser: PN i=1 ai,t = 1. The state of advertiser i, xi,t gives the number of remaining ads to display and evolves according to xt+1,i = xt,i at,i. The initial state is x0 = (10, 11, 12, 10, 14, 9). The reward obtained from advertiser i in state st,i = (xt,i, et) is ri(st,i, at,i) = li,et min(xt,i, at,i), where the parameters li,et are sampled (once per replication) from Uniform(1, 4). D.4 Training parameters Each method was trained for 6,000 episodes for the EV charging problem, 5,000 for the multi-product inventory control problem, and 10,000 episodes for the online stochastic ad matching problem. The episode lengths for the EV charging, online ad stochastic ad matching, and multi-product inventory control problems are 50, 30, and 25, respectively. We performed 5 independent replications. We use a neural network architecture that consists of two hidden layers, with 64 and 32 hidden units respectively, for all algorithms. A rectified linear unit (Re LU) is used as the activation function for each hidden layer. The Adam optimizer [36] with a learning rate of 1.0 10 4 was used. For OTDQN, we use the same parameter settings as in He et al. [25]. For WCDQN, we use a Lagrangian multiplier λ [0, 10], with a 0.01 discretization. We also used an experience buffer of size 100,000 and initialized it with 10,000 experience tuples that were obtained using a random policy. For the WCDQN algorithm, we set the penalty coefficient κU to 10, after performing a small amount of manual hyperparameter tuning on the set {1, 2, 4, 10}. D.5 Sensitivity analysis of WCQL with respect to the number of subproblems We study the performance improvement from WCQL over vanilla Q-learning as the number of subproblems increases for the EV charging problem. We only vary the number of subproblems (from 2 to 5) and keep all other settings as defined in Appendix D.1. The results, given in Table 2, show that the benefits of WCQL become larger as the number of subproblems increases. This provides some additional evidence for the practicality of our approach, especially in regimes where standard methods fail. Table 2: Cumulative reward and percent improvement of WCQL over QL on the EV-charging problem with a different number of subproblems. Algorithm Number of Subproblems 2 3 4 5 QL 5.39 6.7 5.2 3.26 WCQL 5.35 7.14 6.28 4.66 Percent improvement -0.7% 6.6% 20.8% 42.9% E Limitations and Future Work One interesting direction to explore for future work is to address the limitation of learning the Lagrangian upper bound using a fixed and finite set Λ. Instead, one can imagine the ability to learn the optimal value of λ and concentrate the computational effort towards learning the Lagrangian upper bound for this particular λ, which could potentially lead to tighter bounds. A possible approach is to apply subgradient descent on λ, similar to what is done in Hawkins [24].