# a_boosting_approach_to_reinforcement_learning__ce6b97e4.pdf A Boosting Approach to Reinforcement Learning Nataly Brukhim Princeton University nbrukhim@cs.princeton.edu Elad Hazan Princeton University Google AI Princeton ehazan@cs.princeton.edu Karan Singh Carnegie Mellon University karansingh@cmu.edu Reducing reinforcement learning to supervised learning is a well-studied and effective approach that leverages the benefits of compact function approximation to deal with large-scale Markov decision processes. Independently, the boosting methodology (e.g. Ada Boost) has proven to be indispensable in designing efficient and accurate classification algorithms by combining inaccurate rules-of-thumb. In this paper, we take a further step: we reduce reinforcement learning to a sequence of weak learning problems. Since weak learners perform only marginally better than random guesses, such subroutines constitute a weaker assumption than the availability of an accurate supervised learning oracle. We prove that the sample complexity and running time bounds of the proposed method do not explicitly depend on the number of states. While existing results on boosting operate on convex losses, the value function over policies is non-convex. We show how to use a non-convex variant of the Frank-Wolfe method for boosting, that additionally improves upon the known sample complexity and running time even for reductions to supervised learning. 1 Introduction In reinforcement learning, Markov decision processes (MDP) model the mechanism of learning from rewards, as opposed to examples. Although the case of tabular MDPs is well understood, the main challenge in applying RL in the real-world is the size of the state space in practical domains. This challenge of finding efficient and provable algorithms for MDPs with large state space is the focus of our study. Various techniques have been suggested and applied to cope with very large MDPs. One class of approaches attempts to approximate either the value or the transition function of the underlying MDP by using a parametric function class. Such approaches invariably make strong realizability assumptions to produce global optimality guarantees. Another class of approaches, so-called direct methods, produces a near-optimal policy that maximizes the expected return from a given policy class. To deal with the challenge of large (possibly innumerable) policy classes, a popular strategy [24] is to the frame policy search as a sequence of supervised learning problems. Such approaches yield global optimality guarantees under state coverage assumptions without reliance on realizability, and have inspired practical adaptations for sampling-based policy search. In this paper, we study another methodology to derive provable algorithms for reinforcement learning: ensemble methods for aggregating weak or approximate algorithms into substantially more accurate solutions. Our proposal extends the methodology of boosting, typically used to solve supervised learning instances [32], to reinforcement learning. A typical boosting algorithm (e.g. Ada Boost) 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Supervised weak learner Online weak learner Episodic model 1/ 4"5 1/ 2"3 Rollouts w. -resets 1/ 4"6 1/ 2"4 Table 1: Sample complexity of the proposed algorithms for different -weak learning models (supervised & online) and modes of accessing the MDP (rollouts & rollouts with reset distribution ), in terms of and , suppressing other terms. This work is the first to introduce a reduction of RL to weak supervised learning. See Theorem 7 for details. Supervised strong learner This work (Corollary 8) CPI [24] Episodic model 1/"3 1/"4 Rollouts w. -resets 1/"4 1/"4 Table 2: Compared to previous work [24], the table shows sample complexity of the proposed algorithm for a strong ( = 1) supervised learning model and different modes of accessing the MDP. iteratively constructs a near-optimal classifier by combining computationally cheap, yet inaccurate rules-of-thumb. Unlike RL reductions to supervised learning which assume the existence of an efficient and accurate classification or regression procedure, the proposed algorithms builds on learning algorithms that perform only ever-so-slightly better than a random guess, and which thus may be produced cheaply both in computational and statistical terms. Concretely, we assume access to a weak learner: an efficient sample-based procedure that is capable of generating an approximate solution to any weighted multi-class objective over a fixed policy class. We describe an algorithm that iteratively calls this procedure on carefully constructed new objectives, and aggregates the solution into a single policy. We prove that after sufficiently many iterations, our resulting policy has competitive global gurantees on performacnce. Interestingly, unlike boosting algorithms for regression and classification, our resulting aggregation of weak learners is non-linear. 1.1 Challenges and techniques Reinforcement learning is quite different from supervised learning and several difficulties have to be circumvented for boosting to work. Among the challenges that the reinforcement learning setting presents, consider the following, (a) The value function is not a convex or concave function of the policy. This is true even in the tabular case, and even more so if we use a parameterized policy class. (b) The transition matrix is unknown, or prohibitively large to manipulate for large state spaces. This means that even evaluation of a policy cannot be exact, and can only be computed approximately. (c) It is unrealistic to expect a weak learner that attains near-optimal value for a given linear objective over the policy class. At most one can hope for a multiplicative and/or additive approximation of the overall value. Our approach overcomes these challenges by applied several new as well as recently developed techniques. To overcome the nonconvexity of the value function, we use a novel variant of the Frank-Wolfe optimization algorithm that simultaneously delivers on two guarantees. First, it finds a first order stationary point with near-optimal rate. Secondly, if the objective happens to admit a certain gradient domination property, an important generalization of convexity, it also guarantees near optimal value. The application of the nonconvex Frank-Wolfe method is justified due to previous recent investigation of the policy gradient algorithm [2, 1], which identified conditions under which the value function is gradient dominated. The second information-theoretic challenge of the unknown transition function is overcome by careful algorithmic design: our boosting algorithm requires only samples of the transitions and rewards, obtained by rollouts on the MDP. The third challenge is perhaps the most difficult to overcome. Thus far, the use of the Frank-Wolfe method in reinforcement learning did not include a multiplicative approximation, which is critical for our application. We adapt the techniques used for boosting in online convex optimization [19] with a multiplicative weak learner to our setting, by non-linearly aggregating (using a 2-layer network) the weak learners. This aspect is perhaps of general interest to boosting algorithm design, which is mostly based on linear aggregation. 1.2 Our contributions Our main contribution is a novel efficient boosting algorithm for reinforcement learning. Our techniques apply in various settings and the sample complexity bounds of all of our results are summarized in Tables 1 and 2. The input to this algorithm is a weak learning method capable of approximately solving a weighted multi-class problem instance over a certain policy class. The output of the algorithm is a policy which does not belong to the original class considered, hence being an instance of improper learning. It is rather a non-linear aggregation of policies from the original class, according to a two-layer neural network. This is a result of the two-tier structure of our algorithm: an outer loop of non-convex Frank-Wolfe method, and an inner loop of online convex optimization based boosting. The final policy comes with provable global optimality guarantees. Beyond novelty of techniques, an important contribution (Table 1) of our work is to highlight the quantitative difference in guarantees that depend on the mode of accessing the MDP (episodic rollouts vs. access to an exploratory reset distrbution) and the nature of the weak learners (online vs statistical), thus indicating that some algorithmic choices may be preferable compared to others in terms of speed of convergence and sample complexity. As with existing reductions to supervised learning [24], these global convergence guarantees happen under appropriate state coverage assumptions either via access to a reset distribution that has some overlap with the state distribution of the optimal policy, or by constraining the policy class to policies that explore sufficiently. Yet another contribution of our work is to show an improved sample complexity result in the latter setting, even when considering reductions to supervised learning instances. This improvement in convergence in well-studied settings is documented in Table 2. 1.3 Related work Reinforcement learning approaches for dealing with large-scale MDPs rely on function approximation [35]. Such function approximation may be performed on the underlying conditional probability of transition (e.g. [34, 21]) or the value function (e.g. [37, 36]). The provable guarantees in such methods come at the cost of strong realizability assumptions. In contrast, the so-called direct approaches attempt policy search over an appropriate policy class [2, 1], and rely on making making incremental updates, such as variants of Conservative Policy Iteration (CPI) [24, 33, 4], and Policy Search by Dynamic Programming (PSDP)[6]. These provide convergence guarantees under appropriate state coverage assumptions comparable to ones made in this work. Our boosting approach for provable RL builds on the vast literature of boosting for supervised learning [32], and recently online learning [27, 12, 13, 7, 22, 23]. One of the crucial techniques important for our application is the extension of boosting to the online convex optimization setting, with bandit information [10], and critically with a multiplicative weak learner [19]. This latter technique implies a non-linear aggregation of the weak learners. Non-linear boosting was only recently investigated in the context of classification [5], where it was shown to potentially enable significantly more efficient boosting. Another work on boosting in the context of control of dynamical systems [3]. However, this work critically requires knowledge of the underlying dynamics (transitions) and makes convexity assumptions, which we do not, and cannot cope with a multiplicative approximate weak learner. The Frank-Wolfe algorithm is extensively used in machine learning, see e.g. [20], references therein, and recent progress in stochastic Frank-Wolfe methods [16, 28, 11, 39]. Recent literature has applied a variant of this algorithm to reinforcement learning in the context of state space exploration [18]. 2 Preliminaries Optimization. We say that a differentiable function f : K 7! R over some domain K Rd is L-smooth with respect to some norm k k if for every x, y 2 K we have !!f(y) f(x) rf(x)>(y x) . We define the projection Γ : R|A| ! A, with respect to a set A, where A denotes the probability simplex over A. For any x 2 R|A|, Γ[x] = arg miny2 A kx yk. An important generalization of the property of convexity we use henceforth is that of gradient domination. Definition 1 (Gradient Domination). A function f : K ! R is said to be ( , , K1, K2)-locally gradient dominated (around K1 by K2) if for all x 2 K1, it holds that y2K f(y) f(x) max rf(x)>(y x) Markov decision process. An infinite-horizon discounted Markov Decision Process (MDP) M = (S, A, P, r, γ, d0) is specified by: a state space S, an action space A, a transition model P where P(s0|s, a) denotes the probability of immediately transitioning to state s0 upon taking action a at state s, a reward function r : S A ! [0, 1] where r(s, a) is the immediate reward associated with taking action a at state s, a discount factor γ 2 [0, 1); a starting state distribution d0 over S. For any infinite-length state-action sequence (hereafter, called a trajectory), we assign the following value V (& = (s0, a0, s1, a1, . . . )) = P1 t=0 γtr(st, at). The agent interacts with the MDP through the choice of stochastic policy : S ! A it executes. The execution of such a policy induces a distribution over trajectories & = (s0, a0, . . . ) as P(&| ) = d0(s0) Q1 t=0(P(st+1|st, at) (at|st)). Using this description we can associate a state V (s) and state-action Q (s, a) value function with any policy . For an arbitrary distribution d over S, define: Q (s, a) = E γtr(st, at) !!! , s0 = s, a0 = a V (s) = Ea ( |s) [Q (s, a)| , s] , V d = Es0 d [V (s)| ] . Here the expectation is with respect to the randomness of the trajectory induced by in M. When convenient, we shall use V to denote V d0, and V to denote max V . Similarly, to any policy , one may ascribe a (discounted) state-visitation distribution d = d d(s) = (1 γ) P(&| , s0 d) Modes of Accessing the MDP. We henceforth consider two modes of accessing the MDP, that are standard in the reinforcement learning literature, and provide different results for each. The first natural access model is called the episodic rollout setting. This mode of interaction allows us to execute a policy, stop and restart at any point, and do this multiple times. Another interaction model we consider is called rollout with -restarts. This is similar to the episodic setting, but here the agent may draw from the MDP a trajectory seeded with an initial state distribution 6= d0. This interaction model was considered in prior work on policy optimization [24, 2]. The motivation for this model is two-fold: first, can be used to incorporate priors (or domain knowledge) about the state coverage of the optimal policy; second, provides a mechanism to incorporate exploration into policy optimization procedures. 2.1 Weak learning Our boosting algorithms henceforth call upon weak learners to generate weak policies. We formalize the notion of a weak learner next. We consider two types of weak learners, and give different end results based on the different assumptions: weak supervised and weak online learners. In the discussion below, let Rand be a uniformly random policy, i.e. 8(s, a) 2 S A, Rand(a|s) = 1/|A|. The formal definition and results for the online setting are deferred to the appendix. In what follows we define the supervised weak learning model. The natural way to define weak learning is an algorithm whose performance is always slight better than that of random policy, one that chooses an action uniformly at random at any given state. However, in general no learner can outperform a random learner over all label distributions. This motivates the literature on agnostic boosting [25, 9, 19] that defines a weak learner as one that can approximate the best policy in a given policy class. Definition 2 (Weak Supervised Learner). Let 2 (0, 1]. Consider a class L of linear loss functions : RA ! R, a family D of distributions that are supported over S L, and policy class . A weak supervised learning algorithm, for every ", δ > 0, given m(", δ) = log |W| δ samples Dm from any distribution D 2 D outputs a policy W(Dm) 2 such that with probability 1 δ, + (1 ) E(s, ) D Note that the weak learner outputs a policy in which is approximately competitive against the class . As an additional relaxation, instead of requiring that the weak learning guarantee holds for all distributions, in our setup, it will be sufficient that the weak learning assumption holds over natural distributions. Specifically, we define a class of natural distributions D, such that D 2 D if and only if there exists some 2 such that, D(s) = D(s, )dµ( ) = d (s). In particular, while a natural distribution may have arbitrary distribution over labels, its marginal distribution over states must be realizable as the state distribution of some policy in over the MDP M. Therefore, the complexity of weak learning adapts to the complexity of the MDP itself. As an extreme example, in stochastic contextual bandits where policies do not affect the distribution of states (say d0), it is sufficient that the weak learning condition holds with respect to all couplings of a single distribution d0. 3 Algorithm & Main Results In this section we describe our RL boosting algorithm. Here we focus on the case where a supervised weak learning is provided. The online weak learners variant of our result is detailed in the appendix. We next define several definitions and algorithmic subroutines required for our method. 3.1 Policy aggregation For a base class of policies , our algorithm incrementally builds a more expressive policy class by aggregating base policies via both linear combinations and non-linear transformations. In effect, the algorithm produces a finite-width depth-2 circuit over some subset of the base policy class. That is, our approach can be thought of as an aggregation of base policies, which forms a 2-layer neural network, as depicted in Figure 1. The leaves of the tree are the policies 2 the base policy class. These are then linearly aggregated to form the first layer of the tree, denoted 1, 2 in Figure 1. Next, each linear combination of policies in the overall aggregation undergoes a projection operation. The projection may be viewed as a non-linear activation function, such as Re LU, in deep learning terms. Note that the projection of any function from S to R|A| produces a policy, i.e. a mapping from states to distributions over actions. In the analysis of our algorithm we give a particular projection operation Γ[ ] which allows us to yield the desired guarantees. Definition 3 (Policy Projection). Given : S ! R|A|, define a projected policy = Γ[ ] to be a policy such that simultaneously for all s 2 S, it holds that ( |s) = Γ [ (s)] . Definition 4 (Policy Tree). A Policy Tree S ! A with respect to S ! A some base policy class, and N, T 2 N, is a linear combination of T projected policies Γ[ ], where each is a linear combination of N base policies 2 . This final definition describes the set of possible outputs of the boosting procedure. It is important that the policy that the boosting algorithm outputs can be evaluated efficiently. In the appendix we show it is indeed the case (see Lemma 12). Hereafter, we refer to a Policy Tree with respect to , N and T, as for N, T = O(poly(|A|, (1 γ) 1, " 1, 1, log δ 1)) specified later. 3.2 Main results Next, we give the main results of our RL boosting algorithm via weak supervised learning, specified in Algorithm 1. A Policy Tree Figure 1: The figure illustrates a Policy Tree hierarchy (see Definition 4), output of the boosting procedure specified in Algorithm 1. Specifically, it is obtained by setting N = 3 on the inner loop of Internal Boost (Algorithm 2), and T = 2 on the main booster (Algorithm 1). Overall we get all base policies 1, ..., 6 2 on the lower level, to form the Policy Tree 2 . Algorithm 1 RL Boosting 1: Input: number of iterations T, initial state distribution µ, and P, N, M parameters for Internal Boost. 2: Initialize a policy 0 2 arbitrarily. 3: for t = 1 to T do 4: Run Internal Boost (Algorithm 2) with distribution µ and policy t to obtain 0 t. 5: Update t = (1 1,t) t 1 + 1,t 0 t. 6: end for 7: Run each policy t for P rollouts to compute an empirical estimate d V t of the expected return. 8: return := t0 where t0 = arg maxt d V t. To state the results, we need the following definitions. The first generalizes the policy completeness notion from [33]. It may be seen as the policy-equivalent analogue of inherent bellman error [29]. Intuitively, it measures the degree to which a policy in can best approximate the bellman operator in an average sense with respect to the state distribution induced by a policy from . Definition 5 (Policy Completeness). For any initial state distribution µ, and policy classes , , define Eµ = max 2 min 2 Es d µ maxa2A Q (s, a) Q (s, )> ( |s) Definition 6 (Distribution Mismatch). Let = arg max V , and a fixed initial state distribution (see section 2). Define the following distribution mismatch coefficients: C1 = max 2 The above notion of the distribution mismatch coefficient is often useful to characterize the exploration problem faced by policy optimization algorithms. We now give the main result for the output of our RL boosting algorithm, assuming supervised weak learners. Theorem 7. Algorithm 1 samples T(MN + P) episodes of length O( 1 1 γ ) with probability 1 δ. In the episodic model, with µ = d0, for 1,t = min{1, 2C1 16|A|C1 (1 γ)2 C1|A| , δ NT , with probability 1 δ, V V C1E In the -reset model, with µ = , for 1,t = |A|2T , T = 8D2 1 (1 γ)6"2 , N = 16|A|D1 (1 γ)3 8|A|D1 , δ 2NT , with probability 1 δ, V V D1E (1 γ)2 + ". Sample complexities: If m(", δ) = log |W| δ for some measure of weak learning complexity |W|, Algorithm 2 Internal Boost 1: Input: number of iterations N, number of episodes M, initial policy , initial state distribution µ. 2: Set 0 to be an arbitrary policy in . 3: for n = 1 to N do 4: Execute with µ via Algorithm 3 for M episodes, to get Dn = {(si, c Qi)M i=1}. 5: Modify Dn to produce a new dataset D0 n = {(si, fi)}M i=1, such that for all i 2 [m]: β (yi n( |si)) , yi = arg min y2R|A| { b Q> i y + G min z2 A kz yk + k n( |si) yk2 where G = A 1 γ , β = 2γ (1 γ)3 and fi, b Qi 2 R|A|. 6: Let An be the policy chosen by the weak learning oracle when given data set D0 t,n. 7: Update n = (1 2,n) n 1 + 2,n 8: end for 9: return Γ [ N]. Algorithm 3 Trajectory Sampler: samples a state s d , and an unbiased estimate of Q s 1: Sample state s0 µ, action a0 U(A) uniformly. 2: Sample s d as follows: at every timestep h, with probability γ, act according to ; else, accept sh as the sample and proceed to Step 3. 3: Take action a0 at state sh, then continue to execute , and use a termination probability of 1 γ. Upon termination, set R(sh, a0) as the undiscounted sum of rewards from time h onwards. 4: Define the vector d Q sh, such that for all a 2 A, d Q sh(a) = |A| R(sh, a0) Ia=a0. 5: return (sh, d Q sh). the algorithm samples O 1|A|4 log |W| (1 γ)11 4"5 episodes in the episodic model, and O 1|A|4 log |W| (1 γ)18 4"6 in the -reset model. Theorem 7 above pertains to the case where a weak learning algorithm is available. However, another main result is given by considering the simpler approach of reduction of RL to a strong supervised learning algorithm. In particular, when running our main boosting algorithm, we can replace the call to Internal Boost (in Line 4 of Algorithm 1) with a call to a strong supervised learning algorithm. By a similar analysis to that of Theorem 7 we obtain the following corollary. Corollary 8. Let m(", δ) = log |W| δ for some measure of weak learning complexity |W|. When run with a supervised learning oracle (Definition 2 with = 1, i.e. N = 1) as the Internal boosting, Algorithm 1 samples O episodes in the episodic model, and O -reset model, to guarantee V V C1E 1 γ + " with probability 1 δ in the episodic model and V V D1E (1 γ)2 + " in the -reset model. We note that this result is an improvement over previous results in terms of sample complexity requirement of the algorithm. In particular, in [24], Theorem 4.4 and Corollary 4.5 achieve the same guarantee using O(1/"4) samples regardless of the MDP access model. Briefly, CPI utilizes 1/"2 calls to an "-optimal supervised learning oracle (each call needing 1/"2 samples) to reach a "-local optima of the value function. Under requisite state coverage assumptions, this translates to "-function value suboptimality. Indeed, such mode of analysis via first arguing for convergence to a local optima for the CPI algorithm can be shown to be tight. The improvement in our case for the episodic access model comes from the insight that it is possible to make direct claims on the function value sub-optimality (second part of Theorem 9), bypassing the need for making a claim on the local optimality, in the gradient-dominated case. 3.3 Trajectory sampler In Algorithm 3 we describe an episodic sampling procedure, that is used in our sample-based RL boosting algorithms described above. For a fixed initial state distribution µ, and any given policy , we apply the following sampling procedure: start at an initial state s0 µ, and continue to act thereafter in the MDP according to any policy , until termination. With this process, it is straightforward to both sample from the state visitation distribution s d , and to obtain unbiased samples of Q (s, ); see Algorithm 3 for the detailed process. 4 Sketch of the analysis Non-convex Frank-Wolfe. We give an abstract high-level procedural template that the previously introduced RL boosters operate in. This is based on a variant of the Frank-Wolfe optimization technique [15], adapted to non-convex and gradient dominated function classes (see Definition 1). The Frank-Wolfe (FW) method assumes oracle access to a black-box linear optimizer, denoted O, and utilizes it by iteratively making oracle calls with modified objectives, in order to solve the harder task of convex optimization. Analogously, boosting algorithms often assume oracle access to a weak learner, which are utilized by iteratively making oracle calls with modified objective, in order to obtain a strong learner, with boosted performance. In the RL setting, the objective is in fact non-convex, but exhibits gradient domination. By adapting Frank-Wolfe technique to this setting, we will in subsequent section obtain guarantees for the algorithms given in Section 3.Oracle: Denote by O a black-box oracle to an ( 0, K2)-approximate linear optimizer over a convex set K Rd such that for any given v 2 Rd, we have v>O(v) maxu2K2 v>u 0. Algorithm 4 Non-convex Frank-Wolfe 1: Input: T > 0, objective f, linear optimizer O, rate t. 2: Choose x0 2 K arbitrarily. 3: for t = 1, . . . , T do 4: Call zt = O(rt 1), where rt 1 = rf(xt 1). Set xt = (1 t)xt 1 + tzt. 5: end for 6: return x := xt0 where t0 = arg mint r> Theorem 9. Let f : K ! R be L-smooth in some norm k k , bounded for all x 2 K, |f(x)| H for some H > 0, and let the diameter of K in k k be D. Then, for a ( 0, K2)-linear optimization oracle O, and t = = 4H LD2T , the output x of Algorithm 4 satisfies max u2K2 rf( x)>(u x) x 2K f(x ) f( x) 2 2 max{LD2, H} Furthermore, if f is ( , , K1, K2)-locally gradient-dominated and x0, . . . x T 2 K1, then the output x of Algorithm 4 where t = min{1, 2 t } satisfies the bound on the right. We sketch the high-level ideas of the proof of our main result, stated in Theorem 7, and refer the reader to the appendix for the formal proof. We will establish an equivalence between RL Boosting (Algorithm 1) and the variant of the Frank-Wolfe algorithm (Algorithm 4). This abstraction allows us to obtain the novel convergence guarantees given in Theorem 7. Throughout the analysis, we use the notation r V to denote the gradient of the value function with respect to the |S| |A|-sized representation of the policy , namely the functional gradient of V . Internal-boosting weak learners. The Frank-Wolfe algorithm utilizes an inner gradient optimization oracle as a subroutine. To implement this oracle using approximate optimizers, we utilize yet another variant of the FW method as internal-boosting for the weak learners, by employing an adapted analysis of [19] that is stated in Claim 10 below. Let Dt be the distribution induced by the trajectory sampler in round t. Claim 10. Let β = 1/ N, 2,n = min{2/n, 1}. 0 t produced by Algorithm 1 satisfies max 2 E(s,Q) Dt (2|A|/(1 γ) ) From weak learning to linear optimization, Next, we give an important observation which allows us to re-state the guarantee in the previous subsection in terms of linear optimization over functional gradients. The key observation here is that the expensive optimizing procedure for (r V )> 0, which in particular requires iterating over all states in S, can be instead replaced with sampling from an appropriate distribution D (via Algorithm 3). These sample pairs (s, c Q (s, )) could then be fed to our weak learning algorithm, which guarantees generalization. Lemma 11. Applying Algorithm 3 for any given policy yields an unbiased estimate of the gradient, such that for any 0, (r V µ )> 0 = E(s, c Q (s, )) D c Q (s, )> 0( |s) /(1 γ), where 0( |s) 2 A, D is the distribution induced on the outputs of Algorithm 3, for the policy and initial distribution µ. 5 Experiments The primary contribution of the present work is theoretical. Nevertheless, we empirically test our proposal with the experiment designed to elicit qualitative properties of the proposed algorithm, instead of aiming to achieve the state-of-the-art. To validate our results, we check if the proposed algorithm is indeed capable of boosting the accuracy of concrete instantiations of weak learners. We use depth-3 decision trees, with the implementation adapted from Scikit-Learn [30], as our base weak learner. This choice of weak learner is particularly suitable for boosting, because it is an impoverished policy class in a representational sense and hence it is reasonable to expect that it may do only slightly better than random guessing with respect to the classification loss. We consider the performance of the boosting algorithm (Algorithm 1) across multiple rounds of boosting or number of weak learners to that of supervised-learning-based policy iteration; the computational burden of the algorithm scales linearly with the latter. Throughout all the experiments, we used = 0.9. To speed up computation, the plots below were generated by retaining the 3 most recent policies of every iteration in the policy mixture. We evaluated these on the Cart Pole and the Lunar Lander environments. The results demonstrate the proposed RL boosting algorithm succeeds in maximizing rewards while using few weak learners (equivalently, within a few rounds of boosting). Figure 2: Reward trajectory for the Cart Pole (left) and the Lunar Lander (right) environments of the proposed boosting algorithm for N = 20, 50, 100 number of base weak learners is compared to supervised-learning-based policy iteration (decision tree) above. The x-axis corresponds to T number of iterations, and for each t 2 [T], reward is computed over 100 episodes of interactions. The confidence interval is plotted over 3 such runs. 6 Conclusions Building on recent advances in boosting for online convex optimization and bandits, we have described a boosting algorithm for reinforcement learning over large state spaces with provable guarantees. We see this as a first attempt at using a tried-and-tested methodology from supervised learning to RL. [1] Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable policy gradient learning. ar Xiv preprint ar Xiv:2007.08459, 2020. [2] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. ar Xiv preprint ar Xiv:1908.00261, 2019. [3] Naman Agarwal, Nataly Brukhim, Elad Hazan, and Zhou Lu. Boosting for control of dynamical systems. In International Conference on Machine Learning, pages 96 103. PMLR, 2020. [4] Naman Agarwal, Brian Bullins, and Karan Singh. Variance-reduced conservative policy iteration. ar Xiv preprint ar Xiv:2212.06283, 2022. [5] Noga Alon, Alon Gonen, Elad Hazan, and Shay Moran. Boosting simple learners. ar Xiv preprint ar Xiv:2001.11704, 2020. [6] J Andrew Bagnell, Sham Kakade, Andrew Y Ng, and Jeff G Schneider. Policy search by dynamic programming. In Advances in Neural Information Processing Systems, 2003. [7] Alina Beygelzimer, Satyen Kale, and Haipeng Luo. Optimal and adaptive algorithms for online boosting. In International Conference on Machine Learning, pages 2323 2331, 2015. [8] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19 26. JMLR Workshop and Conference Proceedings, 2011. [9] Nataly Brukhim, Xinyi Chen, Elad Hazan, and Shay Moran. Online agnostic boosting via regret minimization. In Advances in Neural Information Processing Systems, 2020. [10] Nataly Brukhim and Elad Hazan. Online boosting with bandit feedback. In Algorithmic Learning Theory, pages 397 420. PMLR, 2021. [11] Lin Chen, Christopher Harshaw, Hamed Hassani, and Amin Karbasi. Projection-free online optimization with stochastic gradient: From convexity to submodularity. In International Conference on Machine Learning, pages 814 823, 2018. [12] Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. An online boosting algorithm with theoretical justifications. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1873 1880, 2012. [13] Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. Boosting with online binary learners for the multiclass bandit problem. In International Conference on Machine Learning, pages 342 350, 2014. [14] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages 272 279, 2008. [15] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95 110, 1956. [16] Hamed Hassani, Mahdi Soltanolkotabi, and Amin Karbasi. Gradient methods for submodular maximization. In Advances in Neural Information Processing Systems, pages 5841 5851, 2017. [17] Elad Hazan. Introduction to online convex optimization. ar Xiv preprint ar Xiv:1909.05207, [18] Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pages 2681 2691. PMLR, 2019. [19] Elad Hazan and Karan Singh. Boosting for online convex optimization. ar Xiv preprint ar Xiv:2102.09305, 2021. [20] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Interna- tional Conference on Machine Learning, pages 427 435. PMLR, 2013. [21] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137 2143. PMLR, 2020. [22] Young Hun Jung, Jack Goetz, and Ambuj Tewari. Online multiclass boosting. In Advances in neural information processing systems, pages 919 928, 2017. [23] Young Hun Jung and Ambuj Tewari. Online boosting algorithms for multi-label ranking. In International Conference on Artificial Intelligence and Statistics, pages 279 287, 2018. [24] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002. [25] Varun Kanade and Adam Kalai. Potential-based agnostic boosting. In Advances in neural information processing systems, pages 880 888, 2009. [26] Alessandro Lazaric and Rémi Munos. Hybrid stochastic-adversarial on-line learning. In Conference on Learning Theory, 2009. [27] Christian Leistner, Amir Saffari, Peter M Roth, and Horst Bischof. On robustness of on-line boosting-a competitive study. In IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pages 1362 1369. IEEE, 2009. [28] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Stochastic conditional gradient methods: From convex minimization to submodular maximization. ar Xiv preprint ar Xiv:1804.09554, 2018. [29] Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5), 2008. [30] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikitlearn: Machine learning in python. the Journal of machine Learning research, 12:2825 2830, 2011. [31] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Stochastic and constrained adversaries. ar Xiv preprint ar Xiv:1104.5070, 2011. [32] Robert E Schapire and Yoav Freund. Boosting: Foundations and Algorithms. MIT Press, 2012. [33] Bruno Scherrer and Matthieu Geist. Local policy search in a convex space and conservative policy iteration as boosted policy search. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 35 50. Springer, 2014. [34] Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pages 2898 2933. PMLR, 2019. [35] Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems, volume 12. MIT Press, 2000. [36] Yuanhao Wang, Ruosong Wang, and Sham M Kakade. An exponential lower bound for linearly-realizable mdps with constant suboptimality gap. ar Xiv preprint ar Xiv:2103.12690, 2021. [37] Gellert Weisz, Philip Amortila, Barnabás Janzer, Yasin Abbasi-Yadkori, Nan Jiang, and Csaba Szepesvári. On query-efficient planning in mdps under linear realizability of the optimal state-value function. ar Xiv preprint ar Xiv:2102.02049, 2021. [38] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning. Machine learning, 8(3-4):229 256, 1992. [39] Jiahao Xie, Zebang Shen, Chao Zhang, Hui Qian, and Boyu Wang. Stochastic recursive gradient-based methods for projection-free online learning. ar Xiv preprint ar Xiv:1910.09396, 2019. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] in the supplementary material. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]