# representationdriven_reinforcement_learning__ec1e03e9.pdf Representation-Driven Reinforcement Learning Ofir Nabati 1 Guy Tennenholtz 2 Shie Mannor 1 3 We present a representation-driven framework for reinforcement learning. By representing policies as estimates of their expected values, we leverage techniques from contextual bandits to guide exploration and exploitation. Particularly, embedding a policy network into a linear feature space allows us to reframe the exploration-exploitation problem as a representation-exploitation problem, where good policy representations enable optimal exploration. We demonstrate the effectiveness of this framework through its application to evolutionary and policy gradient-based approaches, leading to significantly improved performance compared to traditional methods. Our framework provides a new perspective on reinforcement learning, highlighting the importance of policy representation in determining optimal exploration-exploitation strategies. 1. Introduction Reinforcement learning (RL) is a field in machine learning in which an agent learns to maximize a reward through interactions with an environment. The agent maps its current state into action and receives a reward signal. Its goal is to maximize the cumulative sum of rewards over some predefined (possibly infinite) horizon (Sutton & Barto, 1998). This setting fits many real-world applications such as recommendation systems (Li et al., 2010), board games (Silver et al., 2017), computer games (Mnih et al., 2015), and robotics (Polydoros & Nalpantidis, 2017). A large amount of contemporary research in RL focuses on gradient-based policy search methods (Sutton et al., 1999; Silver et al., 2014; Schulman et al., 2015; 2017; Haarnoja et al., 2018). Nevertheless, these methods optimize the pol- 1Department of Electrical-Engineering, Technion Institute of Technology, Israel 2Technion (currently at Google Research) 3Nvidia Research. Correspondence to: Ofir Nabati . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). icy locally at specific states and actions. Salimans et al. (2017) have shown that such optimization methods may cause high variance updates in long horizon problems, while Tessler et al. (2019) have shown possible convergence to suboptimal solutions in continuous regimes. Moreover, policy search methods are commonly sample inefficient, particularly in hard exploration problems, as policy gradient methods usually converge to areas of high reward, without sacrificing exploration resources to achieve a far-reaching sparse reward. In this work, we present Representation-Driven Reinforcement Learning (Rep RL) a new framework for policysearch methods, which utilizes theoretically optimal exploration strategies in a learned latent space. Particularly, we reduce the policy search problem to a contextual bandit problem, using a mapping from policy space to a linear feature space. Our approach leverages the learned linear space to optimally tradeoff exploration and exploitation using well-established algorithms from the contextual bandit literature (Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013). By doing so, we reframe the exploration-exploitation problem to a representation-exploitation problem, for which good policy representations enable optimal exploration. We demonstrate the effectiveness of our approach through its application to both evolutionary and policy gradientbased approaches demonstrating significantly improved performance compared to traditional methods. Empirical experiments on the Mu Jo Co (Todorov et al., 2012) and Min Atar (Young & Tian, 2019) show the benefits of our approach, particularly in sparse reward settings. While our framework does not make the exploration problem necessarily easier, it provides a new perspective on reinforcement learning, shifting the focus to policy representation in the search for optimal exploration-exploitation strategies. 2. Preliminaries We consider the infinite-horizon discounted Markov Decision Process (MDP). An MDP is defined by the tuple M = (S, A, r, T, β, γ), where S is the state space, A is the action space, T : S A (S) is the transition kernel, r : S A [0, 1] is the reward function, β (S) is the initial state distribution, and γ [0, 1) is the discount factor. A stationary policy π : S (A), maps states Representation-Driven Reinforcement Learning into a distribution over actions. We denote by Π the set of stationary stochastic policies, and the history of policies and trajectories up to episode k by Hk. Finally, we denote S = |S| and A = |A|. The return of a policy is a random variable defined as the discounted sum of rewards t=0 γtr(st, at), (1) where s0 β, at π(st), st+1 T(st, at), and the policy s value is its mean, i.e., v(π) = E[ P t=0 γtr(st, at) | β, π, T ]. An optimal policy maximizes the value, i.e., π arg maxπ Π v(π). We similarly define the per-state value function, v(π, s) as v(π, s) = E[ P t=0 γtr(st, at) | s0 = s, π, T ], and note that v(π) = Es β[v(π, s)]. Finally, we denote the discounted state-action frequency distribution w.r.t. π by ρπ(s, a) = (1 γ) t=0 γt Pr st = s, at = a|β, π, T , and let K = {ρπ : π Π}. 2.1. Linear Bandits In this work, we consider the linear bandit framework as defined in Abbasi-Yadkori et al. (2011). At each time t, the learner is given a decision set Dt Rd, which can be adversarially and adaptively chosen. The learner chooses an action xt Dt and receives a reward rt, whose mean is linear w.r.t xt, i.e., E[ rt | xt ] = xt, w for some unknown parameter vector w Rd. A general framework for solving the linear bandit problem is the Optimism in the Face of Uncertainty Linear bandit algorithm (OFUL, Abbasi-Yadkori et al. (2011)). There, a linear regression estimator is constructed each round as follows: ˆwt = V 1 t bt, Vt = Vt 1 + xtx t , bt = bt 1 + xtyt, (2) where yt, xt are the noisy reward signal and chosen action at time t, respectively, and V0 = λI for some positive parameter λ > 0. It can be shown that, under mild assumptions, and with high probability, the self-normalizing norm ˆwt w Vt can be bounded from above (Abbasi-Yadkori et al., 2011). OFUL then proceeds by taking an optimistic action (xt, wt) arg maxx Dt, w Ct x, w , where Ct is a confidence set induced by the aforementioned bound on ˆwt w Vt. In Linear Bandits Policy Representation Construct Decision Set Figure 1. Rep RL scheme. Composed of 4 stages: representation of the parameters, constructing a decision set, choosing the best arm using an off-the-shelf linear bandit algorithm, collect data with the chosen policy. practice, a softer version is used in Chu et al. (2011), where an action is selected optimistically according to xt arg max x Dt x, ˆwt + α q x T V 1 t x, (OFUL) where α > 0 controls the level of optimism. Alternatively, linear Thompson sampling (TS, Abeille & Lazaric (2017)) shows it is possible to converge to an optimal solution with sublinear regret, even with a constant probability of optimism. This is achieved through the sampling of a parameter vector from a normal distribution, which is determined by the confidence set Ct. Specifically, linear TS selects an action according to xt arg max x Dt x, wt , wt N ˆwt, σ2V 1 t , (TS) where σ > 0 controls the level of optimism. We note that for tight regret guarantees, both α and σ need to be chosen to respect the confidence set Ct. Nevertheless, it has been shown that tuning these parameters can improve performance in real-world applications (Chu et al., 2011). 3. RL as a Linear Bandit Problem Classical methods for solving the RL problem attempted to use bandit formulations (Fox & Rolph, 1973). There, the set of policies Π reflects the set of arms, and the value v(π) is the expected bandit reward. Unfortunately, such a solution is usually intractable due to the exponential number of policies (i.e., bandit actions) in Π. Alternatively, we consider a linear bandit formulation of the RL problem. Indeed, it is known that the value can be expressed in linear form as v(π) = E(s,a) ρπ[r(s, a)] = ρπ, r . (3) Here, any ρπ K represents a possible action in the linear bandit formulation (Abbasi-Yadkori et al., 2011). Notice Representation-Driven Reinforcement Learning that |K| = |Π|, as any policy π Π can be written as π(a|s) = ρπ(s,a) P a ρπ(s,a ), rendering the problem intractable. Nevertheless, this formulation can be relaxed using a lower dimensional embedding of ρπ and r. As such, we make the following assumption. Assumption 3.1 (Linear Embedding). There exist a mapping f : Π Rd such that v(π) = f(π), w for all π Π and some unknown w Rd. We note that Assumption 3.1 readily holds when d = SA for f(π) ρπ and w = r. For efficient solutions, we consider environments for which the dimension d is relatively low, i.e., d SA. Note that neural bandit approaches also consider linear representations (Riquelme et al., 2018). Nevertheless, these methods use mappings from states S 7 Rd, whereas we consider mapping entire policies Π 7 Rd (i.e., embedding the function π). Learning a mapping f can be viewed as trading the effort of finding good exploration strategies in deep RL problems to finding a good representation. We emphasize that we do not claim it to be an easier task, but rather a different viewpoint of the problem, for which possible new solutions can be derived. Similar to work on neural-bandits (Riquelme et al., 2018), finding such a mapping requires alternating between representation learning and exploration. 3.1. Rep RL We formalize a representation-driven framework for RL, inspired by linear bandits (Section 2.1) and Assumption 3.1. We parameterize the policy π and mapping f using neural networks, πθ and fϕ, respectively. Here, a policy πθ is represented in lower-dimensional space as fϕ(πθ). Therefore, searching in policy space is equivalent to searching in the parameter space. With slight abuse of notation, we will denote fϕ(πθ) = fϕ(θ). Pseudo code for Rep RL is presented in Algorithm 1. At every episode k, we map the policy s parameters θk 1 to a latent space using fϕk 1(θk 1). We then use a construction algorithm, Construct Decison Set(θk 1, Hk 1), which takes into account the history Hk 1, to generate a new decision set Dk. Then, to update the parameters θk 1 of the policy, we select an optimistic policy πθk Dk using a linear bandit method, such as TS or OFUL (see Section 2.1). Finally, we rollout the policy πθk and update the representation network and the bandit parameters according to the procedure outlined in Equation (2), where xk are the learned representations of fϕk. A visual schematic of our framework is depicted in Figure 1. In the following sections, we present and discuss methods for representation learning, decision set construction, and propose two implementations of Rep RL in the context of evolutionary strategies and policy gradient. We note that Algorithm 1 Rep RL 1: Init: H0 , πθ0, fϕ0 randomly initialized 2: for k = 1, 2, . . . do 3: Representation Stage: Map the policy network πθk 1 using representation network fϕk 1(θk 1). 4: Decision Set Stage: Dk Construct Decison Set(θk 1, Hk 1). 5: Bandit Stage: Use linear bandit algorithm to choose πθk out of Dk. 6: Exploitation Stage: Rollout policy πθk and store the return Gk in Hk. 7: Update representation fϕk. 8: Update bandit parameters ˆwt, Vt (Equation (2)) with the updated representation. 9: end for Rep RL is a framework for addressing RL through representation, and as such, any representation learning technique or decision set algorithm can be incorporated as long as the basic structure is maintained. 3.2. Learning Representations for Rep RL We learn a linear representation of a policy using tools from variational inference. Specifically, we sample a representation from a posterior distribution z fϕ(z|θ), and train the representation by maximizing the Evidence Lower Bound (ELBO) (Kingma & Welling, 2013) L(ϕ, κ) = Ez fϕ(z|θ)[log pκ(G|z)] + DKL(fϕ(z|θ) p(z)), where fϕ(z|θ) acts as the encoder of the embedding, and pκ(G|z) is the return decoder or likelihood term. The latent representation prior p(z) is typically chosen to be a zero-mean Gaussian distribution. In order to encourage linearity of the value (i.e the return s mean) with respect to the learned representation, we chose the likelihood to be a Gaussian distribution with a mean that is linear in the representation, i.e., pκ(G|z) = N(κ z, σ2). When the encoder is also chosen to be a Gaussian distribution, the loss function has a closed form. The choice of the decoder to be linear is crucial, due to the fact that the value is supposed to be linear w.r.t learned embeddings. The parameters ϕ and κ are the learned parameters of the encoder and decoder, respectively. Note that a deterministic mapping occurs when the function fϕ(z|θ) takes the form of the Dirac delta function. A schematic of the architectural framework is presented in Figure 2. 3.3. Constructing a Decision Set The choice of the decision set algorithm (line 4 of Algorithm 1) may have a great impact on the algorithm in terms Representation-Driven Reinforcement Learning Policy Network Representation Linear Regressor Figure 2. The diagram illustrates the structure of the networks in Rep RL. The policy s parameters are fed into the representation network, which acts as a posterior distribution for the policy s latent representation. Sampling from this posterior, the latent representation is used by the bandits algorithm to evaluate the value that encapsulates the exploration-exploitation tradeoff. of performance and computational complexity. Clearly, choosing Dk = Π, k will be unfeasible in terms of computational complexity. Moreover, it may be impractical to learn a linear representation for all policies at once. We present several possible choices of decision sets below. Policy Space Decision Set. One potential strategy is to sample a set of policies centered around the current policy Dk = {θk + ϵi}N i=1, ϵi N(0, ν2I), (4) where ν > 0 controls how local policy search is. This approach is motivated by the assumption that the representation of policies in the vicinity of the current policy will exhibit linear behavior with respect to the value function due to their similarity to policies encountered by the learner thus far. Latent Space Decision Set. An alternative approach involves sampling policies in their learned latent space, i.e., Dk = {zk + ϵi}N i=1, ϵi N(0, ν2I), (5) where zk fϕ(z|θk). The linearity of the latent space ensures that this decision set will improve the linear bandit target (UCB or the sampled value in TS), which will subsequently lead to an improvement in the actual value. This approach enables optimal exploration w.r.t. linear bandits, as it uniformly samples the eigen directions of the precision matrix Vt, rather than only sampling specific directions as may occur when sampling in the parameter space. Unlike Equation (4) constructing the set in Equation (5) presents several challenges. First, in order to rollout the policy πθk, one must construct an inverse mapping to extract the chosen policy from the selected latent representation. This can be done by training a decoder for the policy parameters q(θ|z). Alternatively, we propose to use a decoder-free approach. Given a target embedding z arg maxz Dt z, ˆw , we search for a policy θ arg maxθ fϕ(z |θ). This optimization problem can be solved using gradient descent-based optimization algorithms by varying the inputs to fϕ. A second challenge for latent-based decision sets involves the realizability of such policies. That is, there may exist representations z Dk, which are not mapped by any policy in Π. Lastly, even for realizable policies, the restored θ may be too far from the learned data manifold, leading to an overestimation of its value and a degradation of the overall optimization process. One way to address these issues is to use a small enough value of ν during the sampling process, reducing the probability of the set members being outside the data distribution. We leave more sophisticated methods of latent-based decision sets for future work. History-based Decision Set. An additional approach uses the history of policies at time k to design a decision set. Specifically, at time episode k we sample around the set of policies observed so far, i.e., ℓ [k] {θℓ+ ϵℓ,i}N i=1, ϵℓ,i N(0, ν2I), (6) resulting in a decision set of size Nk. After improving the representation over time, it may be possible to find a better policy near policies that have already been used and were missed due to poor representation or sampling mismatch. This method is quite general, as the history can be truncated only to consider a certain number of past time steps, rather than the complete set of policies observed so far. Truncating the history can help reduce the size of the decision set, making the search more computationally tractable. In Section 5, we compare the various choices of decision sets. Nevertheless, we found that using policy space decisions is a good first choice, due to their simplicity, which leads to stable implementations. Further exploration of other decision sets is left as a topic for future research. 3.4. Inner trajectory sampling Vanilla Rep RL uses the return values of the entire trajectory. As a result, sampling the trajectories at their initial states is the natural solution for both the bandit update and repre- Representation-Driven Reinforcement Learning Algorithm 2 Representation Driven Evolution Strategy 1: Input: initial policy π0 = πθ0, noise ν, step size α, decision set size N, history H. 2: for t = 1, 2, . . . , T do 3: Sample an evaluation set and collect their returns. 4: Update representation ft and bandit parameters ( ˆwt, Vt) using history. 5: Construct a decision set Dt. 6: Use linear bandit algorithm to evaluate each policy in Dt. 7: Update policy using ES scheme (Section 4.1). 8: end for sentation learning. However, the discount factor diminishes learning signals beyond the 1 1 γ effective horizon, preventing the algorithm from utilizing these signals, which may be critical in environments with long-term dependencies. On the other hand, using a discount factor γ = 1 would result in returns with a large variance, leading to poor learning. Instead of sampling from the initial state, we propose to use the discount factor and sample trajectories at various states during learning, enabling the learner to observe data from different locations along the trajectory. Under this sampling scheme, the estimated value would be an estimate of the following quantity: v(π) = Es ρπ[v(π, s)]. In the following proposition we prove that optimizing v(π) is equivalent to optimizing the real value. Proposition 3.2. For a policy π Π, v(π) = v(π) The proof can be found in the Appendix C. That is, sampling along the trajectory from ρπ approximates the scaled value, which, like v(π), exhibits linear behavior with respect to the reward function. Thus, instead of sampling the return defined in Equation (1), we sample G(π) = P t=0 γtr(st, at), where s0 ρπ, at π(st), st+1 T(st, at), both during representation learning and bandit updates. Empirical evidence suggests that uniformly sampling from the stored trajectory produces satisfactory results in practice. 4. Rep RL Algorithms In this section we describe two possible approaches for applying the Rep RL framework; namely, in Evolution Strategy (Wierstra et al., 2014) and Policy Gradients (Sutton et al., 1999). 4.1. Representation Driven Evolution Strategy Evolutionary Strategies (ES) are used to train agents by searching through the parameter space of their policy and sampling their return. In contrast to traditional gradient- Algorithm 3 Representation Driven Policy Gradient 1: Input: initial policy πθ, decision set size N, history H. 2: for t = 1, 2, . . . , T do 3: Collect trajectories using πθ. 4: Update representation f and bandit parameters ( ˆwt, Vt) using history. 5: Compute Policy Gradient loss LP G(θ). 6: Sample a decision set and choose the best policy θ. 7: Compute gradient of the regularized Policy Gradient loss with d(θ, θ) (Equation (7)). 8: end for based methods, ES uses a population of candidates evolving over time through genetic operators to find the optimal parameters for the agent. Such methods have been shown to be effective in training deep RL agents in high-dimensional environments (Salimans et al., 2017; Mania et al., 2018). At each round, the decision set is chosen over the policy space with Gaussian sampling around the current policy as described in Section 3.3. Algorithm 5 considers an ES implementation of Rep RL. To improve the stability of the optimization process, we employ soft-weighted updates across the decision set. This type of update rule is similar to that used in ES algorithms (Salimans et al., 2017; Mania et al., 2018), and allows for an optimal explorationexploitation trade-off, replacing the true sampled returns with the bandit s value. Moreover, instead of sampling the chosen policy, we evaluate it by also sampling around it as done in ES-based algorithms. Each evaluation is used for the bandit parameters update and representation learning process. Sampling the evaluated policies around the chosen policy helps the representation avoid overfitting to a specific policy and generalize better for unseen policies - an important property when selecting the next policy. Unlike traditional ES, optimizing the UCB in the case of OFUL or sampling using TS can encourage the algorithm to explore unseen policies in the parameter space. This exploration is further stabilized by averaging over the sampled directions, rather than assigning the best policy in the decision set. This is particularly useful when the representation is still noisy, reducing the risk of instability caused by hard assignments. An alternative approach uses a subset of Dt with the highest bandit scores, as suggested in Mania et al. (2018), which biases the numerical gradient towards the direction with the highest potential return. 4.2. Representation Driven Policy Gradient Rep RL can also be utilized as a regularizer for policy gradient algorithms. Pseudo code for using Rep RL in policy gradients is shown in Algorithm 6. At each gradient step, a weighted regularization term d(θ, θ) is added, where θ are Representation-Driven Reinforcement Learning Figure 3. The two-dimensional t-SNE visualization depicts the policy representation in the Grid World experiment. On the right, we observe the learned latent representation, while on the left, we see the direct representation of the policy s weights. Each point in the visualization corresponds to a distinct policy, and the color of each point corresponds to a sample of the policy s value. Mean Reward Figure 4. Grid World visualization experiment. Trajectories were averaged across 100 seeds at various times during training, where more recent trajectories have greater opacity. Background colors indicate the level of mean reward. the parameters output by Rep RL with respect to the current parameters for a chosen metric (e.g., ℓ2): Lreg(θ) = LPG(θ) + ζd(θ, θ). (7) After collecting data with the chosen policy and updating the representation and bandit parameters, the regularization term is added to the loss of the policy gradient at each gradient step. The policy gradient algorithm can be either on-policy or off-policy while in our work we experiment with an on-policy algorithm. Similar to the soft update rule in ES, using Rep RL as a regularizer can significantly stabilize the representation process. Applying the regularization term biases the policy toward an optimal exploration strategy in policy space. This can be particularly useful when the representation is still weak and the optimization process is unstable, as it helps guide the update toward more promising areas of the parameter space. In our experiments, we found that using Rep RL as a regularizer for policy gradients improved the stability and convergence of the optimization process. 5. Experiments In order to evaluate the performance of Rep RL, we conducted experiments on various tasks on the Mu Jo Co (Todorov et al., 2012) and Min Atar (Young & Tian, 2019) domains. We also used a sparse version of the Mu Jo Co environments, where exploration is crucial. We used linear TS as our linear bandits algorithm as it exhibited good performance during evaluation. The detailed network architecture and hyperparameters utilized in the experiments are provided in Appendix F. Grid-World Visualization. Before presenting our results, we demonstrate the Rep RL framework on a toy example. Specifically, we constructed a Grid World environment (depicted in Figure 4) which consists of spatially changing, noisy rewards. The agent, initialized at the bottom left state (x, y) = (1, 1), can choose to take one of four actions: up, down, left, or right. To focus on exploration, the rewards were distributed unevenly across the grid. Particularly, the reward for every (x, y) was defined by the Normal random variable r(x, y) N µ(x, y), σ2 , where σ > 0 and µ(x, y) R1 exp n (x x1)2+(y y1)2 R2 exp n (x x2)2+(y y2)2 o +R31{(x,y)=goal}. That is, the reward consisted of Normally distributed noise, with mean defined by two spatial Gaussians, as shown in Figure 4, with R1 > R2, a1 < a2 and a goal state (depicted as a star), with R3 R1, R2. Importantly, the values of R1, R2, R3, a1, a2 were chosen such that an optimal policy would take the upper root in Figure 4. Comparing the behavior of Rep RL and ES on the Grid World environment, we found that Rep RL explored the environment more efficiently, locating the optimal path to the goal. This emphasizes the varying characteristics of state-spacedriven exploration vs. policy-space-driven exploration, which, in our framework, coincides with representationdriven exploration. Figure 3 illustrates a two-dimensional t- Representation-Driven Reinforcement Learning Figure 5. Mu Jo Co experiments during training. The results are for the Mu Jo Co suitcase (top) and the modified sparse Mu Jo Co (bottom). SNE plot comparing the learned latent representation of the policy with the direct representation of the policy weights. Decision Set Comparison. We begin by evaluating the impact of the decision set on the performance of the Rep RL. For this, we tested the three decision sets outlined in Section 3.3. The evaluation was conducted using the Representation Driven Evolution Strategy variant on a sparse Half Cheetah environment. A history window of 20 policies was utilized when evaluating the history-based decision set. A gradient descent algorithm was employed to obtain the parameters that correspond to the selected latent code in the latent-based setting As depicted in Figure 8 at Appendix E, Rep RL demonstrated similar performance for the varying decision sets on the tested domains. In what follows, we focus on policy space decision sets. Mu Jo Co. We conducted experiments on the Mu Jo Co suitcase task using Rep RL. Our approach followed the setting of Mania et al. (2018), in which a linear policy was used and demonstrated excellent performance on Mu Jo Co tasks. We utilized the ES variant of our algorithm (Algorithm 5). We incorporated a weighted update between the gradients using the bandit value and the zero-order gradient of the sampled returns, taking advantage of sampled information and ensuring stable updates in areas where the representation is weak. We first evaluated Rep ES on the standard Mu Jo Co baseline (see Figure 5). Rep ES either significantly outperformed or performed on-par with ES. We also tested a modified, sparse variant of Mu Jo Co. In the sparse environment, a reward was given for reaching a goal each distance interval, denoted as d, where the reward function was defined as: ( 10 c(a), |xagent| mod d = 0 c(a), o.w. Here, c(a) is the control cost associated with utilizing action a, and xagent denotes the location of the agent along the xaxis. The presence of a control cost function incentivized the agent to maintain its position rather than actively exploring the environment. The results of this experiment, as depicted in Figure 5, indicate that the Rep RL algorithm outperformed both the ES and SAC algorithms in terms of achieving distant goals. However, it should be noted that the random search component of the ES algorithm occasionally resulted in successful goal attainment, albeit at a significantly lower rate in comparison to the Rep RL algorithm. Min Atar. We compared the performance of Rep RL on Min Atar (Young & Tian, 2019) with the widely used policy gradient algorithm PPO (Schulman et al., 2017). Specifically, we compared PPO against its regularized version with Rep RL, as described in Algorithm 6, and refer to it as Rep PG. We parametrized the policy by a neural network. Although PPO collects chunks of rollouts (i.e., uses subtrajectories), Rep PG adjusted naturally due to the inner trajectory sampling (see Section 3.4). That is, the critic was used to estimate the value of the rest of the trajectory in cases where the rollouts were truncated by the algorithm. Representation-Driven Reinforcement Learning Figure 6. Min Atar experiments during training. Results are shown in Figure 6. Overall, Rep RL outperforms PPO on all tasks, suggesting that Rep RL is effective at solving challenging tasks with sparse rewards, such as those found in Min Atar. 6. Related Work Policy Optimization: Policy gradient methods (Sutton et al., 1999) have shown great success at various challenging tasks, with numerous improvements over the years; most notable are policy gradient methods for deterministic policies (Silver et al., 2014; Lillicrap et al., 2015), trust region based algorithms (Schulman et al., 2015; 2017), and maximum entropy algorithms (Haarnoja et al., 2018). Despite its popularity, traditional policy gradient methods are limited in continuous action spaces. Therefore, Tessler et al. (2019) suggest optimizing the policy over the policy distribution space rather than the action space. In recent years, finite difference gradient methods have been rediscovered by the RL community. This class of algorithms uses numerical gradient estimation by sampling random directions (Nesterov & Spokoiny, 2017). A closely related family of optimization methods is Evolution Strategies (ES) a class of black-box optimization algorithms that heuristic search by perturbing and evaluating the set members, choosing only the mutations with the highest scores until convergence. Salimans et al. (2017) used ES for RL as a zero-order gradient estimator for the policy, parameterized as a neural network. ES is robust to the choice of the reward function or the horizon length and it also does not need value function approximation as most state-of-art algorithms. Nevertheless, it suffers from low sample efficiency due to the potentially noisy returns and the usage of the final return value as the sole learning signal. Moreover, it is not effective in hard exploration tasks. Mania et al. (2018) improves ES by using only the most promising directions for gradient estimation. Policy Search with Bandits. Fox & Rolph (1973) was one of the first works to utilize multi-arm bandits for policy search over a countable stationary policy set a core approach for follow-up work (Burnetas & Katehakis, 1997; Agrawal et al., 1988). Nevertheless, the concept was left aside due to its difficulty to scale up with large environments. As an alternative, Neural linear bandits (Riquelme et al., 2018; Xu et al., 2020; Nabati et al., 2021) simultaneously train a neural network policy, while interacting with the environment, using a chosen linear bandit method and are closely related to the neural-bandits literature (Zhou et al., 2020; Kassraie & Krause, 2022). In contrast to this line of work, our work maps entire policy functions into linear space, where linear bandit approaches can take effect. This induces an exploration strategy in policy space, as opposed to locally, in action space. Representation Learning. Learning a compact and useful representation of states (Laskin et al., 2020; Schwartz et al., 2019; Tennenholtz & Mannor), actions (Tennenholtz & Mannor, 2019; Chandak et al., 2019), rewards (Barreto et al., 2017; Nair et al., 2018; Toro Icarte et al., 2019), and policies (Hausman et al., 2018; Eysenbach et al., 2018), has been at the core of a vast array of research. Such representations can be used to improve agents performance by utilizing the structure of an environment more efficiently. Policy representation has been the focus of recent studies, including the work by Tang et al. (2022), which, similar to our approach, utilizes policy representation to learn a generalized value function. They demonstrate that the generalized value function can generalize across policies and improve value estimation for actor-critic algorithms, given certain conditions. In another study, Li et al. (2022) enhance the stability and efficiency of Evolutionary Reinforcement Learning (ERL) (Khadka & Tumer, 2018) by adopting a linear policy representation with a shared state representation between the evolution and RL components. In our Representation-Driven Reinforcement Learning research, we view the representation problem as an alternative solution to the exploration-exploitation problem in RL. Although this shift does not necessarily simplify the problem, it transfers the challenge to a different domain, offering opportunities for the development of new methods. 7. Discussion and Future Work We presented Rep RL, a novel representation-driven framework for reinforcement learning. By optimizing the policy over a learned representation, we leveraged techniques from the contextual bandit literature to guide exploration and exploitation. We demonstrated the effectiveness of this framework through its application to evolutionary and policy gradient-based approaches, leading to significantly improved performance compared to traditional methods. In this work, we suggested reframing the explorationexploitation problem as a representation-exploitation problem. By embedding the policy network into a linear feature space, good policy representations enable optimal exploration. This framework provides a new perspective on reinforcement learning, highlighting the importance of policy representation in determining optimal explorationexploitation strategies. As future work, one can incorporate Rep RL into more involved representation methods, including pretrained large Transformers (Devlin et al., 2018; Brown et al., 2020), which have shown great promise recently in various areas of machine learning. Another avenue for future research is the use of Rep RL in scenarios where the policy is optimized in latent space using an inverse mapping (i.e., decoder), as well as more involved decision sets. Finally, while this work focused on linear bandit algorithms, future work may explore the use of general contextual bandit algorithms, (e.g., Square CB Foster & Rakhlin (2020)), which are not restricted to linear representations. 8. Acknowledgments This work was partially funded by the Israel Science Foundation under Contract 2199/20. Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011. Marc Abeille and Alessandro Lazaric. Linear thompson sampling revisited. In Artificial Intelligence and Statistics, pp. 176 184. PMLR, 2017. Rajeev Agrawal, Demosthenis Teneketzis, and Venkatacha- lam Anantharam. Asymptotically efficient adaptive allocation schemes for controlled markov chains: Finite parameter space. Technical report, MICHIGAN UNIV ANN ARBOR COMMUNICATIONS AND SIGNAL PROCESSING LAB, 1988. Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127 135, 2013. Andr e Barreto, Will Dabney, R emi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30, 2017. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020. Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for markov decision processes. Mathematics of Operations Research, 22(1):222 255, 1997. Yash Chandak, Georgios Theocharous, James Kostas, Scott Jordan, and Philip Thomas. Learning action representations for reinforcement learning. In International conference on machine learning, pp. 941 950. PMLR, 2019. Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214. JMLR Workshop and Conference Proceedings, 2011. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. ar Xiv preprint ar Xiv:1802.06070, 2018. Dylan Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 3199 3210. PMLR, 2020. Bennett L Fox and John E Rolph. Adaptive policies for markov renewal programs. The Annals of Statistics, 1(2): 334 341, 1973. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy Representation-Driven Reinforcement Learning deep reinforcement learning with a stochastic actor. International conference on machine learning, pp. 1861 1870, 2018. Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018. Parnian Kassraie and Andreas Krause. Neural contextual bandits without regret. In International Conference on Artificial Intelligence and Statistics, pp. 240 278. PMLR, 2022. Shauharda Khadka and Kagan Tumer. Evolution-guided policy gradient in reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp. 5639 5650. PMLR, 2020. Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010. Pengyi Li, Hongyao Tang, Jianye Hao, Yan Zheng, Xian Fu, and Zhaopeng Meng. Erl-re: Efficient evolutionary reinforcement learning with shared state representation and individual policy representation. ar Xiv preprint ar Xiv:2210.17375, 2022. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. ar Xiv preprint ar Xiv:1803.07055, 2018. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529 533, 2015. Ofir Nabati, Tom Zahavy, and Shie Mannor. Online limited memory neural-linear bandits with likelihood matching. ar Xiv preprint ar Xiv:2102.03799, 2021. Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learning with imagined goals. Advances in neural information processing systems, 31, 2018. Aviv Navon, Aviv Shamsian, Idan Achituve, Ethan Fetaya, Gal Chechik, and Haggai Maron. Equivariant architectures for learning in deep weight spaces. ar Xiv preprint ar Xiv:2301.12780, 2023. Yurii Nesterov and Vladimir Spokoiny. Random gradientfree minimization of convex functions. Foundations of Computational Mathematics, 17(2):527 566, 2017. Athanasios S Polydoros and Lazaros Nalpantidis. Survey of model-based reinforcement learning: Applications on robotics. Journal of Intelligent & Robotic Systems, 86(2): 153 173, 2017. Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. ar Xiv preprint ar Xiv:1802.09127, 2018. Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv preprint ar Xiv:1703.03864, 2017. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. International conference on machine learning, pp. 1889 1897, 2015. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Erez Schwartz, Guy Tennenholtz, Chen Tessler, and Shie Mannor. Language is power: Representing states using natural language in reinforcement learning. ar Xiv preprint ar Xiv:1910.02789, 2019. David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. International conference on machine learning, pp. 387 395, 2014. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1998. Representation-Driven Reinforcement Learning Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999. Hongyao Tang, Zhaopeng Meng, Jianye Hao, Chen Chen, Daniel Graves, Dong Li, Changmin Yu, Hangyu Mao, Wulong Liu, Yaodong Yang, et al. What about inputting policy in value function: Policy representation and policyextended value function approximator. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8441 8449, 2022. Guy Tennenholtz and Shie Mannor. Uncertainty estimation using riemannian model dynamics for offline reinforcement learning. In Advances in Neural Information Processing Systems. Guy Tennenholtz and Shie Mannor. The natural language of actions. In International Conference on Machine Learning, pp. 6196 6205. PMLR, 2019. Chen Tessler, Guy Tennenholtz, and Shie Mannor. Distributional policy optimization: An alternative approach for continuous control. Advances in Neural Information Processing Systems, 32, 2019. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033, 2012. doi: 10.1109/IROS. 2012.6386109. Rodrigo Toro Icarte, Ethan Waldie, Toryn Klassen, Rick Valenzano, Margarita Castro, and Sheila Mc Ilraith. Learning reward machines for partially observable reinforcement learning. Advances in neural information processing systems, 32, 2019. Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and J urgen Schmidhuber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949 980, 2014. Pan Xu, Zheng Wen, Handong Zhao, and Quanquan Gu. Neural contextual bandits with deep representation and shallow exploration. ar Xiv preprint ar Xiv:2012.01780, 2020. Kenny Young and Tian Tian. Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments. ar Xiv preprint ar Xiv:1903.03176, 2019. Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492 11502. PMLR, 2020. Representation-Driven Reinforcement Learning - Appendix A. Algorithms Algorithm 4 Random Search / Evolution Strategy 1: Input: initial policy π0 = πθ0, noise ν, step size α, set size K. 2: for t = 1, 2, . . . , T do 3: Sample a decision set Dt = {θt 1 δi}K i=1, δi N(0, ν2I). 4: Collect the returns {G(θt 1 δi)}K i=1 of each policy in Dt. 5: Update policy θt = θt 1 + α σRK G(θt 1 + δi) G(θt 1 δi) δi Algorithm 5 Representation Driven Evolution Strategy 1: Input: initial policy π0 = πθ0, noise ν, step size α, set size K, decision set size N, history H. 2: for t = 1, 2, . . . , T do 3: Sample an evaluation set {θt 1 δi}K i=1, δi N(0, ν2I). 4: Collect the returns {G(θt 1 δi)}K i=1 from the environment and store them in replay buffer. 5: Update representation ft and bandit parameters ( ˆwt, Vt) using history. 6: Construct a decision set Dt = {θt 1 δi}N i=1, δi N(0, ν2I). 7: Use linear bandit algorithm to evaluate each policy in Dt: {ˆv(θt 1 δi)}N i=1 . 8: Update policy ˆv(θt 1 + δi) ˆv(θt 1 δi) δi, θt =θt 1 + αgt Representation-Driven Reinforcement Learning Algorithm 6 Representation Driven Policy Gradient 1: Input: initial policy πθ, noise ν, step size α, decision set size N, ζ, history H. 2: for t = 1, 2, . . . , T do 3: for 1, 2, . . . , K do 4: Collect trajectory data using πθ. 5: Update representation f and bandit parameters ( ˆw, Σ) using history. 6: end for 7: for 1, 2, . . . , M do 8: Sample a decision set D = {θ + δi}N i=1, δi N(0, ν2I). 9: Use linear bandit algorithm to choose the best parameter θ arg maxθ D z, ˆw for z f(z|θ). 10: Compute g = θ LP G(θ) + ζ θ θ 2 , 11: end for 12: end for B. Variational Interface We present here proof of the ELBO loss for our variational interface, which was used to train the representation encoder. log p(G; ϕ, κ) = log Z z pκ(G|z)p(z)dz z pκ(G|z) p(z) fϕ(z|π)fϕ(z|π)dz = log Ez fϕ(z|π) pκ(G|z) p(z) fϕ(z|π) log pκ(G|z) + Ez fϕ(z|π) log p(z) fϕ(z|π) = Ez fϕ(z|π) log pκ(G|z) DKL(fϕ(z|π) p(z)), where the inequality is due to Jensen s inequality. C. Proof for Proposition 3.2 By definition: s ρπ(s)v(π, s) r(s, a) + γ X s T(s |s, a)v(π, s ) = v(π) + γ X s T(s |s, a)v(π, s ) = v(π) + γ X s ρπ(s )v(π, s ) = v(π) + γ v(π), Representation-Driven Reinforcement Learning where the second equality is due to the Bellman equation and the third is from the definition. Therefore, v(π) = v(π) + γ v(π) = v(π) = v(π) D. Full Rep RL Scheme The diagram presented below illustrates the networks employed in Rep RL. The policy s parameters are inputted into the representation network, which serves as a posterior distribution capturing the latent representation of the policy. Subsequently, a sampling procedure is performed from the representation posterior, followed by the utilization of a linear return encoder, acting as the likelihood, to forecast the return distribution with a linear mean (i.e. the policy s value). This framework is employed to maximize the Evidence Lower Bound (ELBO). Policy Network Representation z Linear Decoder Figure 7. The full diagram illustrates the networks in Rep RL. Representation-Driven Reinforcement Learning E. Decision Set Experiment The impact of different decision sets on the performance of Rep RL was assessed in our evaluation. We conducted tests using three specific decision sets as described in Section 3.3. The evaluation was carried out on a sparse Half Cheetah environment, utilizing the Rep ES variant. When evaluating the history-based decision set, we considered a history window consisting of 20 policies. In the latent-based setting, the parameters corresponding to the selected latent code were obtained using a gradient descent algorithm. The results showed that Rep RL exhibited similar performance across the various decision sets tested in different domains. Figure 8. Plots depict experiments for three decision sets: policy space-based, latent space-based, and history-based. The experiment was conducted on the Sparse Half Cheetah environment. F. Hyperparameters and Network Architecture F.1. Grid-World In the Grid World environment, a 8 8 grid is utilized with a horizon of 20, where the reward is determined by a stochastic function as outlined in the paper: r(x, y) N µ(x, y), σ2 , where σ > 0 and µ(x, y) R1 exp n (x x1)2+(y y1)2 R2 exp n (x x2)2+(y y2)2 o + R31{(x,y)=goal}. The parameters of the environment are set as R1 = 2.5, R2 = 0.3, R3 = 13, σ = 3, a1 = 0.125, a2 = 8. The policy employed in this study is a fully-connected network with 3 layers, featuring the use of the tanh non-linearity operator. The hidden layers dimensions across the network are fixed at 32, followed by a Softmax operation. The state is represented as a one-hot vector. please rephrase the next paragraph so it will sounds more professional: The representation encoder is built from Deep Weight-Space (DWS) layers (Navon et al., 2023), which are equivariant to the permutation symmetry of fully connected networks and enable much stronger representation capacity of deep neural networks compared to standard architectures. The DWS model (DWSNet) comprises four layers with a hidden dimension of 16. Batch normalization is applied between these layers, and a subsequent fully connected layer follows. Notably, the encoder is deterministic, meaning it represents a delta function. For more details, we refer the reader to the code provided in Navon et al. (2023), which was used by us. In the experimental phase, 300 rounds were executed, with 100 trajectories sampled at each round utilizing noisy sampling of the current policy, with a zero-mean Gaussian noise and a standard deviation of 0.1. The ES algorithm utilized a step size of 0.1, while the Rep RL algorithm employed a decision set of size 2048 without a discount factor (γ = 1) and λ = 0.1. F.2. Mu Jo Co In the Mu Jo Co experiments, both ES and Rep ES employed a linear policy, in accordance with the recommendations outlined in (Mania et al., 2018). For each environment, ES utilized the parameters specified by Mania et al. (2018), while Rep ES Representation-Driven Reinforcement Learning employed the same sampling strategy in order to ensure a fair comparison. Rep ES utilized a representation encoder consisting of 4 layers of a fully-connected network, with dimensions of 2048 across all layers, and utilizing the Re LU non-linearity operator. This was followed by a fully-connected layer for the mean and variance. The latent dimension was also chosen to be 2048. After each sampling round, the representation framework (encoder and decoder) were trained for 3 iterations on each example, utilizing an Adam optimizer and a learning rate of 3e 4. When combining learning signals of the ES with Rep ES, a mixture gradient approach was employed, with 20% of the gradient taken from the ES gradient and 80% taken from the Rep ES gradient. Across all experiments, a discount factor of γ = 0.995 and λ = 0.1 were used. F.3. Min Atar In the Min Atar experiments, we employed a policy model consisting of a fully-connected neural network similar to the one utilized in the Grid World experiment, featuring a hidden dimension of 64. The value function was also of a similar structure, with a scalar output. The algorithms collected five rollout chunks of 512 between each training phase. The regulation coefficient chosen for Rep RL was 1, while the discount factor and the mixing factor were set as γ = 0.995 and λ = 0.1. The representation encoder used was similar to the one employed in the Grid World experiments with two layers, followed by a symmetry invariant layer and two fully connected layers.