# extracting_reward_functions_from_diffusion_models__2948e712.pdf Extracting Reward Functions from Diffusion Models Felipe Nuti Tim Franzmeyer João F. Henriques {nuti, frtim, joao}@robots.ox.ac.uk University of Oxford Diffusion models have achieved remarkable results in image generation, and have similarly been used to learn high-performing policies in sequential decision-making tasks. Decision-making diffusion models can be trained on lower-quality data, and then be steered with a reward function to generate near-optimal trajectories. We consider the problem of extracting a reward function by comparing a decisionmaking diffusion model that models low-reward behavior and one that models high-reward behavior; a setting related to inverse reinforcement learning. We first define the notion of a relative reward function of two diffusion models and show conditions under which it exists and is unique. We then devise a practical learning algorithm for extracting it by aligning the gradients of a reward function parametrized by a neural network to the difference in outputs of both diffusion models. Our method finds correct reward functions in navigation environments, and we demonstrate that steering the base model with the learned reward functions results in significantly increased performance in standard locomotion benchmarks. Finally, we demonstrate that our approach generalizes beyond sequential decisionmaking by learning a reward-like function from two large-scale image generation diffusion models. The extracted reward function successfully assigns lower rewards to harmful images.1 1 Introduction Recent work [25, 2] demonstrates that diffusion models which display remarkable performance in image generation are similarly applicable to sequential decision-making. Leveraging a wellestablished framing of reinforcement learning as conditional sampling [37], Janner et al. [25] show that diffusion models can be used to parameterize a reward-agnostic prior distribution over trajectories, learned from offline demonstrations alone. Using classifier guidance [61], the diffusion model can then be steered with a (cumulative) reward function, parametrized by a neural network, to generate near-optimal behaviors in various sequential decision-making tasks. Hence, it is possible to learn successful policies both through (a) training an expert diffusion model on a distribution of optimal trajectories, and (b) by training a base diffusion model on lower-quality or reward-agnostic trajectories and then steering it with the given reward function. This suggests that the reward function can be extracted by comparing the distributions produced by such base and expert diffusion models. The problem of learning the preferences of an agent from observed behavior, often expressed as a reward function, is considered in the Inverse Reinforcement Learning (IRL) literature [56, 43]. In contrast to merely imitating the observed behavior, learning the reward function behind it allows for better robustness, better generalization to distinct environments, and for combining rewards extracted from multiple behaviors. Equal Contribution. 1Video and Code at https://www.robots.ox.ac.uk/~vgg/research/reward-diffusion/ 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Diffusion step t=20 Diffusion step t=5 Diffusion step t=0 Diffusion step t=20 Diffusion step t=5 Diffusion step t=0 Base model denoising direction Expert model denoising direction relative reward Loss(θ) = || relative rewardθ (expert model base model)|| 2 Heatmap of learned relative reward Relative reward function extraction Goal-directed trajectory distribution Undirected trajectory distribution Denoising process Expert Diffusion Base Diffusion Figure 1: We see 2D environments with black walls, in which an agent has to move through the maze to reach the goal in the top left corner (green box). The red shaded box shows the progression from an initial noised distribution over states (at diffusion timestep t = 20, left) to a denoised high-reward expert trajectory on the right. This distribution is modeled by an expert diffusion model. The blue shaded box depicts the same process but for a low-reward trajectory where the agent moves in the wrong direction. This distribution is modeled by a base diffusion model. Our method (green shaded box) trains a neural network to have its gradient aligned to the difference in outputs of these two diffusion models throughout the denoising process. As we argue in Section 4, this allows us to extract a relative reward function of the two models. We observe that the heatmap of the learned relative reward (bottom right) assigns high rewards to trajectories that reach the goal point (red star). Prior work in reward learning is largely based on the Maximum Entropy Inverse Reinforcement Learning (Max Ent IRL) framework [71], which relies on alternating between policy optimization and reward learning. This often comes with the assumption of access to the environment (or a simulator) to train the policy. Reward learning is also of independent interest outside of policy learning, for understanding agents behavior or predicting their actions, with applications in Value Alignment, interpretability, and AI Safety [3, 14, 55]. For example, reward functions can also be utilized to better understand an existing AI system s explicit or implicit "preferences" and tendencies [57]. In this work, we introduce a method for extracting a relative reward function from two decisionmaking diffusion models, illustrated in Figure 1. Our method does not require environment access, simulators, or iterative policy optimization. Further, it is agnostic to the architecture of the diffusion models used, applying to continuous and discrete models, and making no assumption of whether the models are unguided, or whether they use either classifier guidance [61] or classifier-free guidance [22]. We first derive a notion of a relative reward function of two diffusion models. We show that, under mild assumptions on the trajectory distribution and diffusion sampling process, our notion of reward exists and is unique, up to an additive constant. Further, we show that the derived reward is equivalent to the true reward under the probabilistic RL framework [37]. Finally, we propose a practical learning algorithm for extracting the relative reward function of two diffusion models by aligning the gradients of the learned reward function with the differences of the outputs of the base and expert models. As the extracted reward function itself is a feed-forward neural network, i.e. not a diffusion model, it is computationally lightweight. Our proposed method for extracting a relative reward function of two diffusion models could be applicable to several scenarios. For example, it allows for better interpretation of behavior differences, for composition and manipulation of reward functions, and for training agents from scratch or finetune existing policies. Further, relative reward functions could allow to better understand diffusion models by contrasting them. For example, the biases of large models trained on different datasets are not always obvious, and our method may aid interpretability and auditing of models by revealing the differences between the outputs they are producing. We empirically evaluate our reward learning method along three axes. In the Maze2D environments [12], we learn a reward function by comparing a base diffusion model trained on exploratory trajectories and an expert diffusion model trained on goal-directed trajectories, as illustrated in Figure 1. We can see in Figure 3 that our method learns the correct reward function for varying maze configurations. In the common locomotion environments Hopper, Half Cheetah, and Walker2D [12, 7], we learn a reward function by comparing a low-performance base model to an expert diffusion model and demonstrate that steering the base model with the learned reward function results in a significantly improved performance. Beyond sequential-decision making, we learn a reward-like function by comparing a base image generation diffusion model (Stable Diffusion, [54]) to a safer version of Stable Diffusion [59]. Figure 2 shows that the learned reward function penalizes images with harmful content, such as violence and hate, while rewarding harmless images. In summary, our work makes the following contributions: We introduce the concept of relative reward functions of diffusion models, and provide a mathematical analysis of their relation to rewards in sequential decision-making. We propose a practical learning algorithm for extracting relative reward functions by aligning the reward function s gradient with the difference in outputs of two diffusion models. We empirically validate our method in long-horizon planning environments, in high-dimensional control environments, and show generalization beyond sequential decision-making. 2 Related Work Diffusion models, originally proposed by Sohl-Dickstein et al. [61], are an expressive class of generative models that generate samples by learning to invert the process of noising the data. The work of Ho et al. [21] led to a resurgence of interest in the method, followed by [44, 10, 62]. Song et al. [65] proposed a unified treatment of diffusion models and score-based generative models [63, 64] through stochastic differential equations [45], used e.g. in [32]. Diffusion models have shown excellent performance in image generation [54, 10, 59, 52], molecule generation [23, 26, 27], 3D generation [48, 41], video generation [20], language [38], and, crucially for this work, sequential decision-making [25]. Part of their appeal as generative models is due to their steerability. Since sample generation is gradual, other pre-trained neural networks can be used to steer the diffusion model during sampling (i.e. classifier guidance, Section 3). An alternative approach is classifier-free guidance [22], often used in text-to-image models such as [54]. Decision-making with diffusion models have first been proposed by Janner et al. [25], who presented a hybrid solution for planning and dynamics modeling by iteratively denoising a sequence of states and actions at every time step, and executing the first denoised action. This method builds on the probabilistic inference framework for RL, reviewed by Levine [37], based on works including [66, 70, 71, 31, 30]. Diffusion models have been applied in the standard reinforcement learning setting [39, 2, 8], and also in 3D domains [24]. They have similarly been used in imitation learning for generating more human-like policies [47, 53], for traffic simulation [69], and for offline reinforcement learning [25, 68, 17]. Within sequential-decision making, much interest also lies in extracting reward functions from observed behavior. This problem of inverse RL [1, 5] has mostly either been approached by making strong assumptions about the structure of the reward function [71, 43, 15], or, more recently, by employing adversarial training objectives [13]. The optimization objective of our method resembles that of physics-informed neural networks [6, 16, 40, 50, 51, 11], which also align neural network gradients to a pre-specified function. In our case, this function is a difference in the outputs of two diffusion models. To the best of our knowledge, no previous reward learning methods are directly applicable to diffusion models, nor has extracting a relative reward function from two diffusion models been explored before. 3 Background Our method leverages mathematical properties of diffusion-based planners to extract reward functions from them. To put our contributions into context and define notation, we give a brief overview of the probabilistic formulation of Reinforcement Learning presented in [37], then of diffusion models, and finally of how they come together in decision-making diffusion models [25]. Reinforcement Learning as Probabilistic Inference. Levine [37] provides an in-depth exposition of existing methods for approaching sequential decision-making from a probabilistic and causal angle using Probabilistic Graphical Models (PGM) [28]. We review some essential notions to understand this perspective on RL. Denote by X the set of probability distributions over a set X. Markov Decision Process (MDP). An MDP is a tuple S, A, T , r, ρ0, T consisting of a state space S , an action space A, a transition function T : S A S, a reward function r : S A R 0, an initial state distribution ρ0 S, and an episode length T. The MDP starts at an initial state s0 ρ0, and evolves by sampling at π(st), and then st+1 T (st, at) for t 0. The reward received at time t is rt r(st, at). The episode ends at t = T. The sequence of state-action pairs τ = ((st, at))T t=0 is called a trajectory. The framework in [37] recasts such an MDP as a PGM, illustrated in Appendix B, Figure 5b. It consists of a sequence of states (st)T t=0, actions (at)T t=0 and optimality variables (Ot)T t=0. The reward function r is not explicitly present. Instead, it is encoded in the optimality variables via the relation: Ot Ber(e r(st,at)). One can apply Bayes s Rule to obtain: p(τ|O1:T ) p(τ) p(O1:T |τ) (1) which factorizes the distribution of optimal trajectories (up to a normalizing constant) as a prior p(τ) over trajectories and a likelihood term p(O1:T |τ). From the definition of O1:T and the PGM structure, we have p(O1:T |τ) = e P t r(st,at). Hence, the negative log-likelihood of optimality conditioned on a trajectory τ corresponds to its cumulative reward: log p(O1:T |τ) = P t r(st, at). Diffusion Models in Continuous Time. At a high level, diffusion models work by adding noise to data x Rn (forward process), and then learning to denoise it (backward process). The forward noising process in continuous time follows the Stochastic Differential Equation (SDE): dxt = f(xt, t)dt + g(t)dwt (2) where f is a function that is Lipschitz, w is a standard Brownian Motion [36] and g is a (continuous) noise schedule, which regulates the amount of noise added to the data during the forward process (c.f. A.2). Song et al. [65] then use a result of Anderson [4] to write the SDE satisfied by the reverse process of (2), denoted xt, as: d xt = [f( xt, t) g(t)2 xlog pt( xt)] dt + g(t) d wt (3) Here pt(x) denotes the marginal density function of the forward process xt, and w is a reverse Brownian motion (see [4]). The diffusion model is a neural network sΘ(xt, t) with parameters Θ that is trained to approximate xlog pt(xt), called the score of the distribution of xt. The network sΘ(xt, t) can then be used to generate new samples from p0 by taking x T N(0, I) for some T > 0, and simulating (3) backwards in time to arrive at x0 x0, with the the neural network sΘ in place of the score term: d xt = [f( xt, t) g(t)2sΘ( xt, t)] dt + g(t) d wt (4) This formulation is essential for deriving existence results for the relative reward function of two diffusion models in Section 4, as it allows for conditional sampling. For example, to sample from p(x0|y) p(x0) p(y|x0), where y {0, 1}, the sampling procedure can be modified to use xlog p(xt|y) sΘ(xt, t) + xρ(x, t) instead of sΘ( xt, t). Here, ρ(x, t) is a neural network approximating log p(y|xt). The gradients of ρ are multiplied by a small constant ω, called the guidance scale. The resulting guided reverse SDE is as follows: d xt = [f( xt, t) g(t)2[sΘ( xt, t) + ω xρ( xt, t)]] dt + g(t) d wt (5) Informally, this method, often called classifier guidance [61], allows for steering a diffusion model to produce samples x with some property y by gradually pushing the samples in the direction that maximizes the output of a classifier predicting p(y|x). Planning with Diffusion. The above shows how sequential decision-making can be framed as sampling from a posterior distribution p(τ|O1:T ) over trajectories. Section 3 shows how a diffusion model p(x0) can be combined with a classifier to sample from a posterior p(x0|y). These two observations point us to the approach in Diffuser [25]: using a diffusion model to model a prior p(τ) over trajectories, and a reward prediction function ρ(x, t) p(O1:T |τt) to steer the diffusion model. This allows approximate sampling from p(τ|O1:T ), which produces (near-)optimal trajectories. The policy generated by the Diffuser denoises a fixed-length sequence of future states and actions and executes the first action of the sequence. Diffuser can also be employed for goal-directed planning, by fixing initial and goal states during the denoising process. In the previous section, we established the connection between the (cumulative) return P t r(st, at) of a trajectory τ = ((st, at))T t=1 and the value p(y|x) used for classifier guidance (Section 3), with x corresponding to τ and y corresponding to O1:T . Hence, we can look at our goal as finding p(y|x), or equivalently p(O1:T |τ), in order to recover the cumulative reward for a given trajectory τ. Additionally, in Section 4.3, we will demonstrate how the single-step reward can be computed by choosing a specific parametrization for p(y|x). Problem Setting. We consider a scenario where we have two decision-making diffusion models: a base model s(1) ϕ that generates reward-agnostic trajectories, and an expert model s(2) Θ that generates trajectories optimal under some unknown reward function r. Our objective is to learn a reward function ρ(x, t) such that, if ρ is used to steer the base model s(1) ϕ through classifier guidance, we obtain a distribution close to that of the expert model s(2) Θ . From the above discussion, such a function ρ would correspond to the notion of relative reward in the probabilistic RL setting. In the following, we present theory showing that: 1. In an idealized setting where s(1) and s(2) have no approximation error (and are thus conservative vector fields), there exists a unique function ρ that exactly converts s(1) to s(2) through classifier guidance. 2. In practice, we cannot expect such a classifier to exist, as approximation errors might result in diffusion models corresponding to non-conservative vector fields. 3. However, the functions ρ that best approximate the desired property (to arbitrary precision ε) do exist. These are given by Def. 4.4 and can be obtained through a projection using an L2 distance. 4. The use of an L2 distance naturally results in an L2 loss for learning ρ with Gradient Descent. 4.1 A Result on Existence and Uniqueness We now provide a result saying that, once s(1) ϕ and s(2) Θ are fixed, there is a condition on the gradients of ρ that, if met, would allow us to match not only the distributions of s(1) ϕ and s(2) Θ , but also their entire denoising processes, with probability 1 (i.e. almost surely, a.s.). In the following theorem, h plays the role of the gradients xρ(xt, t). Going from time t = 0 to t = T in the theorem corresponds to solving the backward SDE (3) from t = T to t = 0, and f in the theorem corresponds to the drift term (i.e. coefficient of dt) of (3). For proof, see Appendix C. Theorem 4.1 (Existence and Uniqueness). Let T > 0. Let f (1) and f (2) be functions from Rn [0, T] to Rn that are Lipschitz, and g : [0, T] R 0 be bounded and continuous with g(0) > 0. Fix a probability space (Ω, F, P), and a standard Rn-Brownian Motion (wt)t 0. Consider the Itô SDEs: dx(1) t = f (1)(x(1) t , t) dt + g(t) dwt (6) dx(2) t = f (2)(x(2) t , t) dt + g(t) dwt (7) dxt = [f (1)(xt, t) + h(xt, t)] dt + g(t) dwt, where h is Lipschitz (8) and fix an initial condition x(1) 0 = x(2) 0 = x0 = z, where z is a random variable with E[||z||2 2] < . Then (6), (7), and (8) have almost surely (a.s.) unique solutions x(1), x(2) and x with a.s. continuous sample paths. Furthermore, there exists an a.s. unique choice of h such that xt = x(2) t for all t 0, a.s., which is given by h(x, t) = f (2)(x, t) f (1)(x, t). (9) In all diffusion model methods we are aware of, the drift and noise terms of the backward process indeed satisfy the pre-conditions of the theorem, under reasonable assumptions on the data distribution (see C.3), and using a network with smooth activation functions like Mish [42] or Ge LU [19]. Therefore, Theorem 4.1 tells us that, if we were free to pick the gradients xρ(xt, t), setting them to f (2)(xt, t) f (1)(xt, t) would be the best choice: it is the only choice resulting in guided samples exactly reproducing the whole process x(2) (and, in particular, the distribution of x(2) 0 ). We will now see that, in an idealized setting, there exists a unique function ρ satisfying this criterion. We start by recalling the concept of a conservative vector field, from multivariate calculus, and how it relates to gradients of continuously differentiable functions. Definition 4.2 (Conservative Vector Field, Definition 7.6 in [35]). We say that a vector field f is conservative if it is the gradient of a continuously differentiable function Φ, f(x) = xΦ(x) for all x in the domain of f. The function Φ is called a potential function of f. Suppose we had access to the ground-truth scores s(1) true(x, t) and s(2) true(x, t) of the forward processes for the base and expert models (i.e. no approximation error). Then they are equal to xlog p(1) t (x) and xlog p(2) t (x), respectively. If we also assume p(1) t (x) and p(2) t (x) are continuously differentiable, we have that, by Definition 4.2, the diffusion models are conservative for each t. Thus, their difference is also conservative, i.e. the gradient of a continuously differentiable function. Hence, by the Fundamental Theorem for Line Integrals (Th. 7.2 in [35]), there exists a unique ρ satisfying xρ(x, t) = s(2) true(x, t) s(1) true(x, t), up to an additive constant, given by the line integral ρ(x, t) = Z x xref [s(2) true(x , t) s(1) true(x , t)] dx , (10) where xref is some arbitrary reference point, and the line integral is path-independent. In practice, however, we cannot guarantee the absence of approximation errors, nor that the diffusion models are conservative. 4.2 Relative Reward Function of Two Diffusion Models To get around the possibility that s(1) and s(2) are not conservative, we may instead look for the conservative field best approximating s(2)(x, t) s(1)(x, t) in L2(Rn, Rn) (i.e. the space of squareintegrable vector fields, endowed with the L2 norm). Using a well-known fundamental result on uniqueness of projections in L2 (Th. C.8), we obtain the following: Proposition 4.3 (Optimal Relative Reward Gradient). Let s(1) ϕ and s(2) Θ be any two diffusion models and t (0, T], with the assumption that s(2) Θ ( , t) s(1) ϕ ( , t) is square-integrable. Then there exists a unique vector field ht given by ht = argmin Rn||f(x) (s(2) Θ (x, t) s(1) ϕ (x, t))||2 2 dx (11) where Cons(Rn) denotes the closed span of gradients of smooth W 1,2 potentials. Furthermore, for any ε > 0, there is a smooth, square-integrable potential Φ with a square-integrable gradient satisfying: Z Rn|| xΦ(x) ht(x)||2 2 dx < ε (12) We call such an ht the optimal relative reward gradient of s(1) ϕ and s(2) Θ at time t. For proof of Proposition 4.3 see Appendix C.4. It is important to note that for the projection to be well-defined, we required an assumption regarding the integrability of the diffusion models. Without this assumption, the integral would simply diverge. The result in Proposition 4.3 tells us that we can get arbitrarily close to the optimal relative reward gradient using scalar potentials gradients. Therefore, we may finally define the central notion in this paper: Algorithm 1: Relative reward function training. Input: Base s(1) and expert s(2) diffusion models, dataset D, number of iterations I. Output: Relative reward estimator ρθ. Initialize reward estimator parameters θ. for j {1, ..., I} do Sample batch: X0 = [x(1) 0 , ..., x(N) 0 ] from D Sample times: t = [t1, ..., t N] independently in U(0, T] Sample forward process: Xt [x(1) t1 , ..., x(N) t N ] Take an optimization step on θ according to ˆLRRF(θ) = 1 N PN i=1|| xρθ(x(i) ti , ti) (s(2) Θ (x(i) ti , ti) s(1) ϕ (x(i) ti , ti))||2 2 end 75 90 105 120 Estimated Reward Normalized Counts Reward distributions Base (unsafe) Expert (safe) Figure 2: Learned rewards for base and expert diffusion models from Stable Diffusion (Sec. 5). Prompts from the I2P dataset. Definition 4.4 (ε-Relative Reward Function). For an ε > 0, an ε-relative reward function of diffusion models s(1) ϕ and s(2) Θ is a function ρ : Rn [0, T] R such that t (0, T] : Z Rn|| xρ(x, t) ht(x)||2 2 dx < ε (13) where ht denotes the optimal relative reward gradient of s(1) ϕ and s(2) Θ at time t. 4.3 Extracting Reward Functions We now set out to actually approximate the relative reward function ρ. Definition 4.4 naturally translates into an L2 training objective for learning ρ: LRRF(θ) = Et U[0,T ],xt pt h || xρθ(xt, t) (s(2) Θ (xt, t) s(1) ϕ (xt, t))||2 2 i (14) where pt denotes the marginal at time t of the forward noising process. We optimize this objective via Empirical Risk Minimization and Stochastic Gradient Descent. See Algorithm 1 for a version assuming access to the diffusion models and their training datasets, and Algorithm 2 in Appendix D for one which does not assume access to any pre-existing dataset. Our method requires no access to the environment or to a simulator. Our algorithm requires computing a second-order mixed derivative Dθ( xρ(x, t)) Rm n (where m is the number of parameters θ), for which we use automatic differentiation in Py Torch [46]. Recovering Per-Time-Step Rewards. We can parameterize ρ using a single-time-step neural network gθ(s, a, t) as ρ(τt, t) = 1 N PN i=1 gθ(si t, ai t, t), where N is the horizon and τt denotes a trajectory at diffusion timestep t, and si t and ai t denote the ith state and action in τt. Then, gθ(s, a, 0) predicts a reward for a state-action pair (s, a). 5 Experiments In this section, we conduct empirical investigations to analyze the properties of relative reward functions in practice. Our experiments focus on three main aspects: the alignment of learned reward functions with the goals of the expert agent, the performance improvement achieved through steering the base model using the learned reward, and the generalizability of learning reward-like functions to domains beyond decision-making (note that the relative reward function is defined for any pair of diffusion models). For details on implementation specifics, computational requirements, and experiment replication instructions, we refer readers to the Appendix. 5.1 Learning Correct Reward Functions from Long-Horizon Plans Maze2D [12] features various environments which involve controlling the acceleration of a ball to navigate it towards various goal positions in 2D mazes. It is suitable for evaluating reward learning, as it requires effective credit assignment over extended trajectories. Figure 3: Heatmaps of the learned relative reward in Maze2D. 8 denotes the ground-truth goal position. For more examples, see Appendix E.3. Implementation. To conduct our experiments, we generate multiple datasets of trajectories in each of the 2D environments, following the data generation procedure from D4RL [12], except sampling start and goal positions uniformly at random (as opposed to only at integer coordinates). For four maze environments with different wall configurations (depicted in Figure 3), we first train a base diffusion model on a dataset of uniformly sampled start and goal positions, hence representing undirected behavior. For each environment, we then train eight expert diffusion models on datasets with fixed goal positions. As there are four maze configurations and eight goal positions per maze configuration, we are left with 32 expert models in total. For each of the 32 expert models, we train a relative reward estimator ρθ, implemented as an MLP, via gradient alignment, as described in Alg. 1. We repeat this across 5 random seeds. Discriminability Results. For a quantitative evaluation, we train a logistic regression classifier per expert model on a balanced dataset of base and expert trajectories. Its objective is to label trajectories as base or expert using only the predicted reward as input. We repeat this process for each goal position with 5 different seeds, and average accuracies across seeds. The achieved accuracies range from 65.33% to 97.26%, with a median of 84.49% and a mean of 83.76%. These results demonstrate that the learned reward effectively discriminates expert trajectories. Visualization of Learned Reward Functions. To visualize the rewards, we use our model to predict rewards for fixed-length sub-trajectories in the base dataset. We then generate a 2D heatmap by averaging the rewards of all sub-trajectories that pass through each grid cell. Fig. 3 displays some of these heatmaps, with more examples given in App. F. Results. We observe in Fig. 3 that the network accurately captures the rewards, with peaks occurring at the true goal position of the expert dataset in 78.75% 8.96% of the cases for simpler mazes (maze2d-open-v0 and maze2d-umaze-v0), and 77.50% 9.15% for more advanced mazes (maze2d-medium-v1 and maze2d-large-v1). Overall, the network achieves an average success rate of 78.12% 6.40%. We further verify the correctness of the learned reward functions by retraining agents in Maze2D using the extracted reward functions (see Appendix E.2.1). Sensitivity to Dataset Size. Since training diffusion models typically demands a substantial amount of data, we investigate the sensitivity of our method to dataset size. Remarkably, we observe that the performance only experiences only a slight degradation even with significantly smaller sizes. For complete results, please refer to Appendix E.3. 5.2 Steering Diffusion Models to Improve their Performance Having established the effectiveness of relative reward functions in recovering expert goals in the low-dimensional Maze2D environment, we now examine their applicability in higher-dimensional control tasks. Specifically, we evaluate their performance in the Half Cheetah, Hopper, and Walker-2D environments from the D4RL offline locomotion suite [12]. These tasks involve controlling a multidegree-of-freedom 2D robot to move forward at the highest possible speed. We assess the learned reward functions by examining whether they can enhance the performance of a weak base model when used for classifier-guided steering. If successful, this would provide evidence that the learned relative rewards can indeed bring the base trajectory distribution closer to the expert distribution. Implementation. In these locomotion environments, the notion of reward is primarily focused on moving forward. Therefore, instead of selecting expert behaviors from a diverse set of base behaviors, our reward function aims to guide a below-average-performing base model toward improved performance. Specifically, we train the base models using the medium-replay datasets from D4RL, which yield low rewards, and the expert models using the expert datasets, which yield high rewards. Figure 4: The 3 images with the highest learned reward in their batch ( safe ), and 3 with the lowest reward ( unsafe , blurred), respectively. Table 1: Diffuser performance with different steering methods, on 3 RL environments and for a low-performance base model (top) and a mediumperformance model (bottom). Environment Unsteered Discriminator Reward (Ours) Halfcheetah 30.38 0.38 30.41 0.38 31.65 0.32 Hopper 24.67 0.92 25.12 0.95 27.04 0.90 Walker2d 28.20 0.99 27.98 0.99 38.40 1.02 Mean 27.75 27.84 32.36 Halfcheetah 59.41 0.87 70.79 1.92 69.32 0.8 Hopper 58.8 1.01 59.42 2.23 64.97 1.15 Walker2d 96.12 0.92 96.75 1.91 102.05 1.15 Mean 71.44 75.65 78.78 The reward functions are then fitted using gradient alignment as described in Algorithm 1. Finally, we employ classifier guidance (Section 3) to steer each base model using the corresponding learned reward function. Results. We conduct 512 independent rollouts of the base diffusion model steered by the learned reward, using various guidance scales ω (Eq. 5). We use the unsteered base model as a baseline. We also compare our approach to a discriminator with the same architecture as our reward function, trained to predict whether a trajectory originates from the base or expert dataset. We train our models with 5 random seeds, and run the 512 independent rollouts for each seed. Steering the base models with our learned reward functions consistently leads to statistically significant performance improvements across all three environments. Notably, the Walker2D task demonstrates a 36.17% relative improvement compared to the unsteered model. This outcome suggests that the reward functions effectively capture the distinctions between the two diffusion models. See Table 1 (top three rows). We further conducted additional experiments in the Locomotion domains in which we steer a new medium-performance diffusion model unseen during training, using the relative reward function learned from the base diffusion model and the expert diffusion model. We observed in Table 1 (bottom three rows) that our learned reward function significantly improves performance also in this scenario. 5.3 Learning a Reward-Like Function for Stable Diffusion While reward functions are primarily used in sequential decision-making problems, we propose a generalization to more general domains through the concept of relative reward functions, as discussed in Section 4. To empirically evaluate this generalization, we focus on one of the domains where diffusion models have demonstrated exceptional performance: image generation. Specifically, we examine Stable Diffusion [54], a widely used 859m-parameter diffusion model for general-purpose image generation, and Safe Stable Diffusion [59], a modified version of Stable Diffusion designed to mitigate the generation of inappropriate images. Models. The models under consideration are latent diffusion models, where the denoising process occurs in a latent space and is subsequently decoded into an image. These models employ classifierfree guidance [22] during sampling and can be steered using natural language prompts, utilizing CLIP embeddings [49] in the latent space. Specifically, Safe Stable Diffusion [59] introduces modifications to the sampling loop of the open-source model proposed by Rombach et al. [54], without altering the model s actual weights. In contrast to traditional classifier-free guidance that steers samples toward a given prompt, the modified sampler of Safe Stable Diffusion also directs samples away from undesirable prompts [59, Sec. 3]. Prompt Dataset. To investigate whether our reward networks can detect a "relative preference" of Safe Stable Diffusion over the base Stable Diffusion model for harmless images, we use the I2P prompt dataset introduced by Schramowski et al. [59]. This dataset consists of prompts specifically designed to deceive Stable Diffusion into generating imagery with unsafe content. However, we use the dataset to generate sets of image embeddings rather than actual images, which serve as training data for our reward networks. A portion of the generated dataset containing an equal number of base and expert samples is set aside for model evaluation. Separating Image Distributions. Despite the complex and multimodal nature of the data distribution in this context, we observe that our reward networks are capable of distinguishing between base and expert images with over 90% accuracy, despite not being explicitly trained for this task. The reward histogram is visualized in Figure 2. Qualitative Evaluation. We find that images that receive high rewards correspond to safe content, while those with low rewards typically contain unsafe or disturbing material, including hateful or offensive imagery. To illustrate this, we sample batches from the validation set, compute the rewards for each image, and decode the image with the highest reward and the one with the lowest reward from each batch. Considering the sensitive nature of the generated images, we blur the latter set as an additional safety precaution. Example images can be observed in Figure 4. 6 Conclusion To the best of our knowledge, our work introduces the first method for extracting relative reward functions from two diffusion models. We provide theoretical justification for our approach and demonstrate its effectiveness in diverse domains and settings. We expect that our method has the potential to facilitate the learning of reward functions from large pre-trained models, improving our understanding and the alignment of the generated outputs. It is important to note that our experiments primarily rely on simulated environments, and further research is required to demonstrate its applicability in real-world scenarios. 7 Acknowledgements This work was supported by the Royal Academy of Engineering (RF\201819\18\163). F.N. receives a scholarship from Fundação Estudar. F.N. also used TPUs granted by the Google TPU Research Cloud (TRC) in the initial exploratory stages of the project. We would like to thank Luke Melas-Kyriazi, Prof. Varun Kanade, Yizhang Lou, Ruining Li and Eduard Oravkin for proofreading versions of this work. F.N. would also like to thank Prof. Stefan Kiefer for the support during the project, and Michael Janner for the productive conversations in Berkeley about planning with Diffusion Models. We use the following technologies and repositories as components in our code: Py Torch [46], Num Py [18], Diffuser [25], D4RL [12], Hugging Face Diffusers [67] and Lucid Rains s Diffusion Models in Pytorch repository. [1] Abbeel, P. and Ng, A. Y. [2004], Apprenticeship learning via inverse reinforcement learning, in Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004 . [2] Ajay, A., Du, Y., Gupta, A., Tenenbaum, J., Jaakkola, T. and Agrawal, P. [2022], Is conditional generative modeling all you need for decision-making? , ar Xiv preprint ar Xiv:2211.15657 . [3] Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J. and Mané, D. [2016], Concrete problems in ai safety . [4] Anderson, B. D. [1982], Reverse-time diffusion equation models , Stochastic Processes and their Applications 12(3), 313 326. URL: https://www.sciencedirect.com/science/article/pii/0304414982900515 [5] Arora, S. and Doshi, P. [2021], A survey of inverse reinforcement learning: Challenges, methods and progress , Artificial Intelligence 297, 103500. [6] Artificial neural networks for solving ordinary and partial differential equations [1998], IEEE transactions on neural networks 9(5), 987 1000. [7] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. and Zaremba, W. [2016], Openai gym , ar Xiv preprint ar Xiv:1606.01540 . [8] Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B. and Song, S. [2023], Diffusion policy: Visuomotor policy learning via action diffusion , ar Xiv preprint ar Xiv:2303.04137 . [9] De Bortoli, V. [2022], Convergence of denoising diffusion models under the manifold hypothesis , ar Xiv preprint ar Xiv:2208.05314 . [10] Dhariwal, P. and Nichol, A. [2021], Diffusion models beat gans on image synthesis , Advances in Neural Information Processing Systems 34, 8780 8794. [11] Fang, J., Liu, C., Simos, T. and Famelis, I. T. [2020], Neural network solution of single-delay differential equations , Mediterranean Journal of Mathematics 17, 1 15. [12] Fu, J., Kumar, A., Nachum, O., Tucker, G. and Levine, S. [2020], D4rl: Datasets for deep data-driven reinforcement learning , ar Xiv preprint ar Xiv:2004.07219 . [13] Fu, J., Luo, K. and Levine, S. [2017], Learning robust rewards with adversarial inverse reinforcement learning , ar Xiv preprint ar Xiv:1710.11248 . [14] Gabriel, I. [2020], Artificial Intelligence, Values, and Alignment , Minds and Machines 30(3), 411 437. URL: https://doi.org/10.1007/s11023-020-09539-2 [15] Hadfield-Menell, D., Russell, S. J., Abbeel, P. and Dragan, A. [2016], Cooperative inverse reinforcement learning, in Advances in neural information processing systems , pp. 3909 3917. [16] Han, J., Jentzen, A. and E, W. [2018], Solving high-dimensional partial differential equations using deep learning , Proceedings of the National Academy of Sciences 115(34), 8505 8510. [17] Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J. G. and Levine, S. [2023], Idql: Implicit q-learning as an actor-critic method with diffusion policies , ar Xiv preprint ar Xiv:2304.10573 . [18] Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C. and Oliphant, T. E. [2020], Array programming with Num Py , Nature 585(7825), 357 362. URL: https://doi.org/10.1038/s41586-020-2649-2 [19] Hendrycks, D. and Gimpel, K. [2016], Gaussian error linear units (gelus) , ar Xiv preprint ar Xiv:1606.08415 . [20] Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J. et al. [n.d.], Imagen video: High definition video generation with diffusion models . [21] Ho, J., Jain, A. and Abbeel, P. [2020], Denoising diffusion probabilistic models , Advances in Neural Information Processing Systems 33, 6840 6851. [22] Ho, J. and Salimans, T. [n.d.], Classifier-free diffusion guidance, in Neur IPS 2021 Workshop on Deep Generative Models and Downstream Applications . [23] Hoogeboom, E., Satorras, V. G., Vignac, C. and Welling, M. [2022], Equivariant diffusion for molecule generation in 3d, in International Conference on Machine Learning , PMLR, pp. 8867 8887. [24] Huang, S., Wang, Z., Li, P., Jia, B., Liu, T., Zhu, Y., Liang, W. and Zhu, S.-C. [2023], Diffusion-based generation, optimization, and planning in 3d scenes , ar Xiv preprint ar Xiv:2301.06015 . [25] Janner, M., Du, Y., Tenenbaum, J. and Levine, S. [2022], Planning with diffusion for flexible behavior synthesis . [26] Jing, B., Corso, G., Chang, J., Barzilay, R. and Jaakkola, T. S. [n.d.], Torsional diffusion for molecular conformer generation, in Advances in Neural Information Processing Systems . [27] Jing, B., Erives, E., Pao-Huang, P., Corso, G., Berger, B. and Jaakkola, T. S. [n.d.], Eigenfold: Generative protein structure prediction with diffusion models, in ICLR 2023-Machine Learning for Drug Discovery workshop . [28] Jordan, M. I. [2004], Graphical models , Statistical Science 19(1), 140 155. URL: http://www.jstor.org/stable/4144379 [29] Jost, J. [2012], Partial differential equations, Vol. 214, Springer Science & Business Media. [30] Kappen, H. [2011], Optimal control theory and the linear bellman equation . [31] Kappen, H. J., Gómez, V. and Opper, M. [2012], Optimal control as a graphical model inference problem , Machine learning 87, 159 182. [32] Karras, T., Aittala, M., Aila, T. and Laine, S. [n.d.], Elucidating the design space of diffusion-based generative models, in Advances in Neural Information Processing Systems . [33] Kingma, D. P. and Ba, J. [2014], Adam: A method for stochastic optimization , ar Xiv preprint ar Xiv:1412.6980 . [34] Kreyszig, E. [1978], Introduction to functional analysis with application, book . [35] Lax, P. D. and Terrell, M. S. [2017], Multivariable calculus with applications, Springer. [36] Le Gall, J.-F. [2016], Brownian motion, martingales, and stochastic calculus, Springer. [37] Levine, S. [2018], Reinforcement learning and control as probabilistic inference: Tutorial and review , ar Xiv preprint ar Xiv:1805.00909 . [38] Li, X., Thickstun, J., Gulrajani, I., Liang, P. S. and Hashimoto, T. B. [2022], Diffusion-lm improves controllable text generation , Advances in Neural Information Processing Systems 35, 4328 4343. [39] Liang, Z., Mu, Y., Ding, M., Ni, F., Tomizuka, M. and Luo, P. [2023], Adaptdiffuser: Diffusion models as adaptive self-evolving planners , ar Xiv preprint ar Xiv:2302.01877 . [40] Magill, M., Qureshi, F. and de Haan, H. [2018], Neural networks trained to solve differential equations learn general representations , Advances in neural information processing systems 31. [41] Melas-Kyriazi, L., Rupprecht, C., Laina, I. and Vedaldi, A. [2023], Realfusion: 360 {\deg} reconstruction of any object from a single image , ar Xiv preprint ar Xiv:2302.10663 . [42] Misra, D. [n.d.], Mish: A self regularized non-monotonic activation function . [43] Ng, A. Y., Russell, S. et al. [2000], Algorithms for inverse reinforcement learning., in Icml , Vol. 1, p. 2. [44] Nichol, A. Q. and Dhariwal, P. [2021], Improved denoising diffusion probabilistic models, in International Conference on Machine Learning , PMLR, pp. 8162 8171. [45] Øksendal, B. and Øksendal, B. [2003], Stochastic differential equations, Springer. [46] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J. and Chintala, S. [2019], Pytorch: An imperative style, high-performance deep learning library, in Advances in Neural Information Processing Systems 32 , Curran Associates, Inc., pp. 8024 8035. URL: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learninglibrary.pdf [47] Pearce, T., Rashid, T., Kanervisto, A., Bignell, D., Sun, M., Georgescu, R., Macua, S. V., Tan, S. Z., Momennejad, I., Hofmann, K. et al. [2023], Imitating human behaviour with diffusion models , ar Xiv preprint ar Xiv:2301.10677 . [48] Poole, B., Jain, A., Barron, J. T. and Mildenhall, B. [2023], Dreamfusion: Text-to-3d using 2d diffusion, in The Eleventh International Conference on Learning Representations . URL: https://openreview.net/forum?id=Fj Nys5c7Vy Y [49] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. et al. [2021], Learning transferable visual models from natural language supervision, in International conference on machine learning , PMLR, pp. 8748 8763. [50] Raissi, M. [2018], Deep hidden physics models: Deep learning of nonlinear partial differential equations , The Journal of Machine Learning Research 19(1), 932 955. [51] Raissi, M., Perdikaris, P. and Karniadakis, G. E. [2019], Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations , Journal of Computational physics 378, 686 707. [52] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. and Chen, M. [2022], Hierarchical text-conditional image generation with clip latents , ar Xiv preprint ar Xiv:2204.06125 . [53] Reuss, M., Li, M., Jia, X. and Lioutikov, R. [2023], Goal-conditioned imitation learning using score-based diffusion policies , ar Xiv preprint ar Xiv:2304.02532 . [54] Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B. [2022], High-resolution image synthesis with latent diffusion models, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 10684 10695. [55] Russel, S. [n.d.], Human compatible: Artificial intelligence and the problem of control (great britain: Allen lane, 2019). ibo van de poel, embedding values in artificial intelligence (ai) systems. , Minds and Machines 30, 385 409. [56] Russell, S. [1998], Learning agents for uncertain environments, in Proceedings of the eleventh annual conference on Computational learning theory , pp. 101 103. [57] Russell, S. [2019], Human compatible: Artificial intelligence and the problem of control, Penguin. [58] Särkkä, S. and Solin, A. [2019], Applied stochastic differential equations, Vol. 10, Cambridge University Press. [59] Schramowski, P., Brack, M., Deiseroth, B. and Kersting, K. [2022], Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models , ar Xiv preprint ar Xiv:2211.05105 . [60] Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. [2017], Proximal policy optimization algorithms , ar Xiv preprint ar Xiv:1707.06347 . [61] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. and Ganguli, S. [2015], Deep unsupervised learning using nonequilibrium thermodynamics, in International Conference on Machine Learning , PMLR, pp. 2256 2265. [62] Song, J., Meng, C. and Ermon, S. [2021], Denoising diffusion implicit models, in International Conference on Learning Representations . URL: https://openreview.net/forum?id=St1giar CHLP [63] Song, Y. and Ermon, S. [2019], Generative modeling by estimating gradients of the data distribution, in Proceedings of the 33rd International Conference on Neural Information Processing Systems , pp. 11918 11930. [64] Song, Y. and Ermon, S. [2020], Improved techniques for training score-based generative models , Advances in neural information processing systems 33, 12438 12448. [65] Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S. and Poole, B. [n.d.], Score-based generative modeling through stochastic differential equations, in International Conference on Learning Representations . [66] Todorov, E. [2008], General duality between optimal control and estimation, in 2008 47th IEEE Conference on Decision and Control , IEEE, pp. 4286 4292. [67] von Platen, P., Patil, S., Lozhkov, A., Cuenca, P., Lambert, N., Rasul, K., Davaadorj, M. and Wolf, T. [2022], Diffusers: State-of-the-art diffusion models , https://github.com/huggingface/diffusers. [68] Wang, Z., Hunt, J. J. and Zhou, M. [2022], Diffusion policies as an expressive policy class for offline reinforcement learning , ar Xiv preprint ar Xiv:2208.06193 . [69] Zhong, Z., Rempe, D., Xu, D., Chen, Y., Veer, S., Che, T., Ray, B. and Pavone, M. [2022], Guided conditional diffusion for controllable traffic simulation , ar Xiv preprint ar Xiv:2210.17366 . [70] Ziebart, B. D. [2010], Modeling purposeful adaptive behavior with the principle of maximum causal entropy, Carnegie Mellon University. [71] Ziebart, B. D., Maas, A. L., Bagnell, J. A., Dey, A. K. et al. [2008], Maximum entropy inverse reinforcement learning., in Aaai , Vol. 8, Chicago, IL, USA, pp. 1433 1438. Appendix A More Details on Diffusion Models A.1 Diffusion Models in Continuous Time At a high level, diffusion models work by adding noise to data x, and then learning to denoise it. The seminal paper on diffusion models [61] and the more recent work of Ho et al. [21] describe this in discrete time. Song et al. [65] show a direct correspondence between the aforementioned formulations of denoising diffusion models and an existing line of work on score-based generative modeling [63], and propose a continuoustime framework leveraging (Itô) Stochastic Differential Equations [36, 45, 58] to unify both methods. We give an overview of the continuous-time formulation, and defer to Appendix A.2 an account of the discrete case. The forward noising process in continuous time is given by: dxt = f(xt, t)dt + g(t)dwt (15) where f is a function that is Lipschitz, w is a standard Brownian Motion [36] and g is a continuous noise schedule (c.f. A.2). A continuous version of the processes in A.2 [61, 21] is recovered by setting g(s) = βs and f(x, s) = 1 2βsx. The resulting SDE resembles an Ornstein-Uhlenbeck process, which is known to converge in geometric rate to a N(0, I) distribution. This justifies the choice of distribution of x T in A.2. Song et al. [65] then use a result of Anderson [4] to write the SDE satisfied by the reverse process of 2, denoted xt, as: d xt = [f( xt, t) g(t)2 xlog pt( xt)] dt + g(t) d wt (16) Here pt(x) denotes the marginal density function of the forward process xt, and w is a reverse Brownian motion (for further details on the latter, see [4]). The learning step is then to learn an approximation sΘ(xt, t) to xlog pt(xt) (the score of the distribution of xt). This is shown by Song et al. [65] to be equivalent to the noise model ϵΘ in A.2. The sampling step consists of taking x T N(0, I) for some T > 0, and simulating 16 backwards in time to arrive at x0 x0. One of the advantages of this perspective is that the sampling loop can be offloaded to standard SDE solvers, which can more flexibly trade-off e.g. computation time for performance, compared to hand-designed discretizations [32]. A.2 Discrete-Time Diffusion Models The seminal paper on diffusion models [61] and the more recent work of Ho et al. [21] formulate diffusion models in discrete time. The forward noising is a Markov Chain starting from an uncorrupted data point x0, and repeatedly applying a (variance-preserving) Gaussian kernel to the data: qt(xt+1|xt) N( p 1 βtxt, βt I) The coefficients βt are known as the noise schedule of the diffusion model, and can be chosen to regulate the speed with which noise is added to the data [32]. The modeling step then aims to revert the noising process by learning a backward kernel pθ(xt 1|xt) N(µθ(xt, t), σ2 t ) (17) The seminal paper of Sohl-Dickstein et al. [61] proposes a variational loss for training µθ, based on a maximumlikelihood framework. Subsequently, Ho et al. [21] proposed a new parametrization of µθ, as well as a much simpler loss function. Their loss function is Lsimple(θ) := Et,x0,ϵ h ϵ ϵθ( αtx0 + 1 αtϵ, t) 2i (18) where αt := 1 βt, αt := Qt s=1 αs and µθ(xt, t) = 1 αt xt βt 1 αt ϵθ(xt, t) . Intuitively, this loss says that ϵθ(xt, t) should fit the noise ϵ added to the initial datapoint x0 up to time t. For a full account of this derivation, we refer the reader to Ho et al. [21]. A.2.1 Classifier Guidance Sohl-Dickstein et al. [61] also propose a method for steering the sampling process of diffusion models. Intuitively, if the original diffusion model computes a distribution p(x0), we would like to make it compute p(x0|y), where we assume, for the sake of concreteness, that y {0, 1}. Applying Bayes s rule gives the factorization p(x0|y) p(x0) p(y|x0). The first term is the original distribution the diffusion model is trained on. The second term is a probabilistic classifier, inferring the probability that y = 1 for a given sample x0. (a) Probabilistic Graphical Model (PGM) for sampling from a Markov Decision Process (MDP) with a fixed policy π. (b) PGM for planning, with optimality variables O1:T . Figure 5: Reinforcement Learning as Probabilistic Inference (see App. B for a detailed explanation). We now leverage the framework in A.1 to derive the classifier guidance procedure. Equation 16 shows we can sample from the backward process by modeling xlog pt(xt) using sΘ(xt, t). Thus, to model p(x0|y), we may instead approximate xlog p(xt|y) using: xlog p(xt|y) = xlog p(xt) p(y|xt) = xlog p(xt) + xlog p(y|xt) (20) sΘ(xt, t) + xρ(xt, t) (21) where ρ(x, t) is a neural network approximating log p(y|xt). In practice, we multiply the gradients of ρ by a small constant ω, called the guidance scale. The guided reverse SDE hence becomes: d xt = [f( xt, t) g(t)2[sΘ( xt, t) + ω xρ( xt, t)]] dt + g(t) d wt (22) The main takeaway is, informally, that we can steer a diffusion model to produce samples with some property y by gradually pushing the samples in the direction that maximizes the output of a classifier predicting p(y|x). Appendix B More Details on Reinforcement Learning as Probabilistic Inference Levine [37] provides an in-depth exposition on existing methods for approaching sequential decision-making from a probabilistic and causal angle, referencing the work of e.g. [66, 70, 71, 31, 30]. We refer to it as the Control as Inference Framework (CIF). The starting point for the formulation is the classical notion of a Markov Decision Problem (MDP), a stochastic process given by a state space S, an action space A, a (possibly stochastic) transition function T : S A S, and a reward function r : S A R 0. Here X denotes the set of distributions over a set X. In this context, a policy π : S A represents a (possibly stochastic) way of choosing an action given the current state. The MDP starts at an initial state s0 S, and evolves by sampling at π(st), and then st+1 T (st, at) for t 0. The reward received at time t is rt r(st, at). We consider here an episodic setting, where the MDP stops at a fixed time T > 0. We call the sequence ((st, at))T t=0 a trajectory. Sampling corresponds to the Probabilistic Graphical Model (PGM) in 5a, where each action depends on the current state, and the next state depends on the current state and the current action. To encode the process of choosing a policy to optimize a reward, this PGM can be extended with optimality variables, as per Fig. 5b. They are defined as Ot Ber(e r(st,at)), so that their distribution encodes the dependency of reward on current states and actions. The problem of optimizing the reward can then be recast as sampling from p(τ|O1:T ). Following [37], one can apply Bayes s Rule to obtain p(τ|O1:T ) p(τ) p(O1:T |τ) (23) which factorizes the distribution of optimal trajectories (up to a normalizing constant) as a prior p(τ) over trajectories and a likelihood term p(O1:T |τ). Observe that, from the definition of O1:T and the PGM factorization in Fig. 5b, we have p(O1:T |τ) = e P t r(st,at). Hence, in the CIF, log p(O1:T |τ) (a negative log likelihood) corresponds to the cumulative reward of a trajectory τ. Appendix C Details on Main Derivations C.1 Existence and Uniqueness of Strong Solutions to Stochastic Differential Equations Theorem C.1 (Existence and uniqueness theorem for SDEs, c.f. page 66 of Øksendal and Øksendal [45]). Let T > 0 and b( , ) : [0, T] Rn Rn, σ( , ) : [0, T] Rn Rn m be measurable functions satisfying |b(t, x)|+|σ(t, x)| C(1 + |x|); x Rn, t [0, T] for some constant C, where |σ|2= P |σij|2, and such that |b(t, x) b(t, y)|+|σ(t, x) σ(t, y)| D|x y|; x, y Rn, t [0, T] for some constant D. Let Z be a random variable which is independent of the σ-algebra F (m) generated by Bs( ), s 0 and such that E |Z|2 < Then the stochastic differential equation d Xt = b (t, Xt) dt + σ (t, Xt) d Bt, 0 t T, X0 = Z has a unique t-continuous solution Xt(ω) with the property that Xt(ω) is adapted to the filtration F Z t generated by Z and Bs( ); s t and 0 |Xt|2 dt < . Remark C.2. In the above, the symbol | | is overloaded, and taken to mean the norm of its argument. As everything in sight is finite-dimensional, and as all norms in finite-dimensional normed vector spaces are topologically equivalent (Theorem 2.4-5 in Kreyszig [34]), the stated growth conditions do not depend on the particular norm chosen for Rn. Also, whenever we talk about a solution to an SDE with respect to e.g. the standard Brownian motion w and an initial condition z, we are always referring only to (F z t )t 0-adapted stochastic processes, as per Theorem C.1. Corollary C.3. Fix a Brownian Motion w and let (Ft)t 0 be its natural filtration. Consider an SDE dxt = f(xt, t) + g(t)dwt with initial condition z L2 independent of F = t 0Ft. Suppose f : Rn [0, T] Rn is Lipschitz with respect to (Rn [0, T], Rn) and g : [0, T] R 0 is continuous and bounded. Then the conclusion of C.1 holds and there is an a.s. unique (F z t )t 0-adapted solution (xt)t 0 having a.s. continuous paths, where (F z t )t 0 is defined as in C.1. Proof. Firstly, note that, as f and g are continuous, they are Lebesgue-measurable. It remains to check the growth conditions in C.1. As f is Lipschitz, there exists C1 (w.l.o.g. C1 > 1) such that, for any x, y Rn and t, s [0, T]: |f(x, t) f(y, s)| C1(|x y|+|t s|) (24) C1(|x y|+T) (25) In particular, taking y = 0 and s = 0, we obtain that |f(x, t) f(0, 0)| C1(|x|+T). Let M = |f(0, 0)|/C1. Then, by the triangle inequality, |f(x, t)| C1(|x|+T + M). As g is bounded, there exists C2 > 0 such that |g(t)| C2, and so for any t, s [0, T], we have: |g(t) g(s)| 2C2 (26) Hence we have, for t [0, T] and x, y Rn: |f(x, t)|+|g(t)| ((T + M)C1 + C2)(1 + |x|) |f(x, t) f(y, t)| C1|x y| (27) which yields the growth conditions required in C.1, and the conclusion follows. C.2 Proof of Theorem 4.1 Proof of Theorem 4.1. We reference the fundamental result on the existence and uniqueness of strong solutions to SDEs, presented in Section 5.2.1 of [45] and reproduced above. In particular, Corollary C.3 shows that the Existence and Uniqueness Theorem C.1 applies for SDEs 6 and 7, by our assumptions on f (1), f (2) and g. As also we restrict our attention to h that are Lipschitz, and sums of Lipschitz functions are also Lipschitz, the corollary also applies to SDE 8. This establishes the existence of a.s. unique (adapted) solutions with a.s. continuous sample paths for these SDEs. Call them x(1), x(2) and x, respectively. Denote h1(x, t) = f (2)(x, t) f (1)(x, t). The choice h = h1 makes SDE 8 have the same drift and noise coefficients as SDE 7, and so, by the uniqueness of solutions we established, it follows that x(2) t = xt for all t a.s., with this choice of h. Now suppose we have a Lipschitz-continuous h which yields an a.s. unique, t-continuous solution ( xt)t 0 that is indistinguishable from x(2). (i.e. equal to x(2) for all t, a.s.). We show that h = h1. As x and x(2) are indistinguishable and satisfy SDEs with the same noise coefficient, we obtain a.s., for any t [0, T]: 0 = xt x(2) t (28) 0 f (1)( xs, s) + h( xs, s)ds + Z t 0 f (2)(x(2) s , s)ds + Z t 0 f (1)( xs, s) + h( xs, s)ds Z t 0 f (2)(x(2) s , s)ds (30) 0 f (1)(x(2) s , s) + h(x(2) s , s)ds Z t 0 f (2)(x(2) s , s)ds (31) 0 h(x(2) s , s) (f (2)(x(2) s , s) f (1)(x(2) s , s))ds (32) 0 h(x(2) s , s) h1(x(2) s , s)ds (33) where in 29 we substitute the integral forms of the SDEs satisfied by x and x(2), and in 31 we use that the processes are indistinguishable to replace x by x(2) in one of the integrals. Hence, with probability 1, R t 0 h(x(2) s , s)ds = R t 0 h1(x(2) s , s)ds. As x(2) is a.s. continuous, we also have that the mappings s 7 h1(x(2) s , s) and s 7 h(x(2) s , s) are continuous with probability 1. We may hence apply the Fundamental Theorem of Calculus and differentiate 33 to conclude that a.s. h(x(2) s , s) = h1(x(2) s , s) for all s. As g(0) > 0, x(2) t is supported on all of Rn for any t > 0. Therefore, the above implies that h(x, s) = h1(x, s) for all s (0, T] and all x Rn. More formally, let (x , s) = h(x , s) h1(x , s). Then (x(2) s , s) = 0 for all s a.s.. Also, is Lipschitz. Let C be its Lipschitz constant. Take x Rn and fix ε > 0 and t > 0. Then P(||x(2) t x||2< ε 2C ) > 0, as x(2) t is supported in all of Rn. Hence, by the above, there exists yε B(x, ε 2C ) such that (yε, t) = 0. As h is Lipschitz, for any x B(x, ε 2C ), we have that || (x , t)|| C||x yε|| (34) C(||x x||+||x yε||) (35) As ε was arbitrary, we have that (x , t) 0 as x x. As is continuous (since it is Lipschitz), it follows that (x, t) = 0. As x was also arbitrary, we have that h(x, s) = h1(x, s) for all s (0, T] and all x Rn. As h must be chosen to be (Lipschitz) continuous also with respect to t, for any x it must be that h(x, 0) = lim s 0+ h(x, s) = lim s 0+ h(x, s) = h1(x, 0) (38) This completes the proof of the uniqueness of the choice h = h1. C.3 Does the Backward Process of a Diffusion Model Fit in Theorem C.1? In the forward process 2, f is taken to be Lipschitz, and g is always bounded, e.g. in [21, 62, 65]. However, in the backward process, one may be concerned about whether the score term xlog pt(xt, t) is Lipschitz. We argue that it is reasonable to assume that it is. We consider a setting where g(t) = 1, and f(x, t) = λx. Then the forward process is the well-understood Ornstein-Uhlenbeck (or Langevin) process [45, p. 74]: dxt = λxt dt + dwt (39) Consider the simplified case where the initial condition is deterministic: x0 = y Rn. Then x is a Gaussian process of co-variance function given by K(s, t) = cov (Xs, Xt) = e λ|t s| e λ(t+s) 2λ I, where I is the identity matrix, as per Le Gall [36], page 226. Its mean function is mt = ye λt. In particular, it has variance σ2 t = 1 e 2λt 2λ , and σ2 t 1 2λ as t . We follow the theoretical analysis of the convergence of diffusion models by De Bortoli [9] and assume the backward diffusion process finishes at a time ε > 0. Hence, the backward process xt is now supported in all of Rn for any t [ϵ, T]. The original data need not be supported everywhere, and in fact is often assumed to not be, c.f. the manifold hypothesis in [9]. In our simplified case, it is supported at a point. This is why t = 0 needs to be excluded by the backward process finishing at some ε > 0. From the aforementioned properties of the Ornstein-Uhlenbeck process, pt(xt) = Cexp ||xt mt||2 2 2σ2 t C is a constant independent of x. But then xlog pt(xt, t) = xt mt σ2 t . For any fixed t, we hence have that xlog pt(xt, t) is 1/σ2 t -Lipschitz. Since the variance σt is bounded away from 0 for t [ε, T], we may pick a Lipschitz constant that works uniformly across all x and t (for instance, 1/σ2 ε). This can be extended to the case where the initial data is bounded, which is frequently assumed (e.g. by De Bortoli [9]). Informally, if the data is bounded, it cannot change the tails of the distribution of xt significantly. We omit further details, as this discussion is beyond the scope of the paper. In summary, the forward noising process is very similar to an Ornstein-Uhlenbeck process, so, for practical purposes, it is reasonable to assume the scores xlog pt(xt, t) are Lipschitz. C.4 Existence and Uniqueness of Minimizer of (11) Definition C.4. L2(Rn, Rn) is the Hilbert space given by L2(Rn, Rn) = f : Rn Rn : Z Rn||f(x)||2 2dx < (40) Remark C.5. It is easy to show L2(Rn, Rn) is a Hilbert Space with norm given by ||f||2 2= Z Rn||f(x)||2 2 dx (41) For instance, one can start from the standard result that L2(Rn, R) is a Hilbert Space, and apply it to each coordinate of the output of f. Definition C.6. Denote by Cons(Rn) the space of the gradients of smooth, square-integrable potentials from Rn to R with square-integrable gradient, i.e. Cons(Rn) = f : f is smooth, Z Rn|f(x)|2dx < and Z Rn|| f(x)||2 2dx < (42) Remark C.7. Clearly Cons(Rn) L2(Rn, Rn). The condition that f is square-integrable and has a squareintegrable gradient corresponds to f being in the Sobolev space W 1,2(Rn) (see Jost [29], Chapter 9). Denote by Cons(Rn) the closure of Cons(Rn), i.e. Cons(Rn) = {f : there exists ( fk)k 0 Cons(Rn) such that || fk f||2 2 0 as k } (43) By construction, Cons(Rn) is a vector subspace of L2(Rn, Rn), and is closed, i.e. stable under taking limits. Theorem C.8 (Complementation in Hilbert Spaces, c.f. Kreyszig [34], Theorem 3.3-4). Let (H, , , || ||) be a Hilbert space, and let Y H be a closed vector subspace. Then any x H can be written as x = y + z, where y Y and z Y . Corollary C.9 (Uniqueness of Projection). Then the minimum of v 7 ||v x|| over v Y is attained at the (unique) y given in the theorem, as ||v x||2 | v x, z |= ||z||2, and setting v = y attains this bound. Proof of Proposition 4.3. Firstly note that we assumed s(2) Θ ( , t) s(1) ϕ ( , t) is in L2(Rn, Rn) for each individual t. It follows directly from the above that, as Cons(Rn) is a closed subspace of L2(Rn, Rn), Corollary C.9 applies, and we have that there is a unique minimizer ht Cons(Rn) in Equation 11. As Cons(Rn) consists of L2(Rn, Rn) limits of sequences in Cons(Rn), there exists a sequence ( Φk)k 0 of gradients of W 1,2-potentials such that || Φk ht||2 2 0 as k . From the definition of convergence, we get that, for any ε > 0, there is k large enough such that Z Rn|| xΦk(x) ht(x)||2 2 dx < ε (44) which completes the proof. Appendix D Relative Reward Learning Algorithms The below algorithm was used for learning the relative reward functions for experiments with Stable Diffusion (see Section 5.3). Algorithm 2: Relative reward function training with access only to diffusion models Input: Base diffusion model s(1) and expert diffusion model s(2). Output: Relative reward estimator ρθ. // Dataset pre-generation using the diffusion models D1 D2 for m {1, 2}, K times do x T N(0, I) for t T 1, ..., 0 do x(1) t xt+1 denoised by 1 more step using s(1) x(2) t xt+1 denoised by 1 more step using s(2) xt x(m) t Add (t + 1, xt+1, x(1) t , x(2) t ) to Dm end end // Training D D1 D2 Initialize parameters θ for i 1...n_train_steps do // Using batch size of 1 for clarity Sample (t + 1, xt+1, x(1) t , x(2) t ) from D // use pre-computed diffusion outputs to compute loss ˆLRRF(θ) || xρθ(xt+1, t + 1) (x(2) t x(1) t )||2 2 Take an Adam [33] optimization step on θ according to ˆLIRL(θ) end Remark D.1. In our experiments with Stable Diffusion, the generation of the datasets D1 and D2 uses a prompt dataset for generating the images. For each prompt, we generate one image following the base model (m = 1), and one following the expert model (m = 2). The I2P prompt dataset [59] contains 4703 prompts. Appendix E Implementation Details, Additional Results and Ablations Table 2: Reward achieved by the agent when trained either with the groundtruth reward or with the extractive relative reward (Ours). Note that a policy taking random actions achieves close to zero reward. Environment Groundtruth Reward Relative Reward (Ours) Open Maze 92.89 11.79 76.45 19.10 UMaze 94.94 8.89 74.52 18.32 Medium Maze 423.21 51.30 276.10 65.21 Large Maze 388.76 121.39 267.56 98.45 Table 3: Performance bounds for ablation of t-stopgrad and guide scales. Environment Lower Bound Mean Upper Bound Half Cheetah 30.24 0.39 30.62 31.5 0.35 Hopper 19.65 0.27 22.32 25.03 0.65 Walker2d 31.65 0.92 34.35 38.14 1.08 E.1 Implementation Details E.1.1 Diffusion Models. Following [25] Section 3, we use the DDPM framework of Ho et al. 2022 with a cosine beta schedule. We do not use preconditioning in the sense of Section 5 of [32]. However, we do clip the the denoised samples during sampling, and apply scaling to trajectories, as discussed below in Appendix E.1.6. E.1.2 Relative Reward Function Architectures. The architectures and dimensions for all experiments are given in Table 4. In the Locomotion environments and in Stable Diffusion, we parameterize the relative reward functions using the encoder of a UNet, following the architecture of Janner et al. [25] for parameterizing their value function networks. In the Maze environments, we use an MLP taking in a sequence of h state-action pairs. In the following, the dimensions refer to the number of channels after each downsampling step in the case of UNet architectures, and to layer dimensions in the case of MLP architectures. The horizon refers to the number of state-action pairs input to the reward function network. When it does not match the diffusion horizon H, we split trajectories into H/h segments, feed each into the reward network, and average the results. E.1.3 Relative Reward Function Training Hyperparameters. In Table 5 we indicate the learning rate and batch size used for training the reward functions, as well as the number of denoising timesteps of the diffusion models they are trained against. We report the number of training steps for the models used to generate the plots and numerical results in the main paper. We use Adam [46] as an optimizer, without weight decay. The Diffusion models use ancestral sampling as in [25, 21], with a cosine noise schedule [44]. E.1.4 Conditional Generation Hyperparameters. In Locomotion environments, we find that the performance of the steered model is affected by the guidance scale ω and by t_stopgrad, a hyperparameter specifying the step after which we stop adding reward gradients to the diffusion model sample. For example, if t_stopgrad = 5, we only steer the generation up to timestep 5, and for steps 4 down to 0 no more steering is used. In Tables 6 and 7 we indicate the respective values used for the results reported in Table 1 (see Discriminator and Reward (Ours)), as well as for the results presented in the later presented ablation study in Appendix E.3.2 (see Reward (Ours, Ablation)). E.1.5 Using Estimates of E[xt 1|x0, xt] for Gradient Alignment. Diffusion models can be trained to output score estimates sθ(xt, t) xlog pt(xt), but also estimates of a denoised sample: µθ(xt, t) = 1 αt xt βt 1 αt sθ(xt, t) E[xt 1|x0, xt]. For two diffusion models s(1) ϕ and s(2) Θ , we have that µϕ(xt, t) µΘ(xt, t) s(2) Θ (xt, t) s(1) ϕ (xt, t) (notice that the sign gets flipped). However, we find that aligning the reward function gradients to the difference in means µϕ(xt, t) µΘ(xt, t) (instead of the difference in scores) leads to slightly better performance. Hence, in practice, we use the following reward function: LRRF(θ) = Et U[0,T ],xt pt || xρθ(xt, t) (µϕ(xt, t) µΘ(xt, t))||2 2 (45) Notice that, since during sampling we multiply the outputs of ρ by a guidance scale ω > 0 of our choice, this reward function is equivalent to the one in the main paper, up to a scale factor. Table 4: Model architectures and dimensions of the relative reward functions. Environment Architecture Dimensions Diffusion Horizon Reward Horizon maze2d-open-v0 MLP (32,) 256 64 maze2d-umaze-v0 MLP (128,64,32) 256 4 maze2d-medium-v1 MLP (128,64,32) 256 4 maze2d-large-v1 MLP (128,64,32) 256 4 Halfcheetah UNet Encoder (64,64,128,128) 4 4 Hopper UNet Encoder (64,64,128,128) 32 32 Walker2D UNet Encoder (64,64,128,128) 32 32 Stable Diffusion UNet Encoder (32,64,128,256) - - Table 5: Parameters for the training of the relative reward functions. Environment Learning Rate Batch Size Diffusion Timesteps Training Steps Maze 5 10 5 256 100 100000 Locomotion 2 10 4 128 20 50000 Stable Diffusion 1 10 4 64 50 6000 Table 6: Parameters for sampling (rollouts) in Locomotion guidance scale ω. Environment Discriminator Reward (Ours, Ablation) Reward (Ours) Halfcheetah ω = 0.5 ω = 0.1 ω = 0.2 Hopper ω = 0.4 ω = 0.1 ω = 0.4 Walker2d ω = 0.2 ω = 0.3 ω = 0.3 E.1.6 Trajectory Normalization. Following the implementation of Janner et al. [25], we use min-max normalization for trajectories in Maze and Locomotion environments. The minima and maxima are computed for each coordinate of states and actions, across all transitions in the dataset. Importantly, they are computed without any noise being added to the transitions. For our experiments with Stable Diffusion, we use standard scaling (i.e. making the data be centered at 0 and have variance 1 on each coordinate). Since in this case a dataset is pre-generated containing all intermediate diffusion timesteps, the means and standard deviations are computed separately for each diffusion timestep, and for each coordinate of the CLIP latents. E.1.7 Reward Function Horizon vs. Diffuser Horizon. We consider different ways of picking the number of timesteps fed into our network for reward prediction. Note that the horizon of the reward model network must be compatible with the length of the Diffuser trajectories (256). To account for this, we choose horizons h that are powers of two and split a given diffusor trajectory of length 256 into 256/h sub-trajectories, for which we compute the reward separately. We then average the results of all windows (instead of summing them, so that the final numerical value does not scale inversely with the chosen horizon). The reward horizon h used for different experiments is indicated in Table 4. E.1.8 Visualization Details. To obtain visualizations of the reward function that are intuitively interpretable, we modified the D4RL [12] dataset in Maze2D to use start and goal positions uniformly distributed in space, as opposed to only being supported at integer coordinates. Note that we found that this modification had no influence on performance, but was only made to achieve visually interpretable reward functions (as they are shown in Figure 3). E.1.9 Quantitative Evaluation in Maze2D. We quantitatively evaluated the learned reward functions in Maze2D by comparing the position of the global maximum of the learned reward function to the true goal position. We first smoothed the reward functions slightly to remove artifacts and verified whether the position of the global reward maximum was within 1 grid cell of the correct goal position. We allow for this margin as in the trajectory generation procedure the goal position is not given as an exact position but by a circle with a diameter of 50% of a cell s side length. We train reward functions for 8 distinct goal positions in each of the four maze configurations and for 5 seeds each. The confidence intervals are computed as ˆp 1.96 q n , where ˆp is the fraction of correct goal predictions, and n is the total number of predictions (e.g. for the total accuracy estimate, n = 4 mazes 8 goals 5 seeds = 160). E.1.10 Running Time. The Diffusers [25] take approximately 24 hours to train for 1 million steps on an NVIDIA P40 GPU. Reward functions took around 2 hours to train for 100000 steps for Maze environments, and around 4.5 hours for 50000 steps in Locomotion environments, also on an NVIDIA Tesla P40 GPU. For the Stable Diffusion experiment, it took around 50 minutes to run 6000 training steps on an NVIDIA Tesla M40 GPU. For Locomotion environment evaluations, we ran 512 independent rollouts in parallel for each steered model. Running all 512 of them took around 1 hour, also on an NVIDIA Tesla P40. Table 7: Parameters for sampling (rollouts) in Locomotion t_stopgrad. Environment Discriminator Reward (Ours, Ablation) Reward (Ours) Halfcheetah t_stopgrad = 8 t_stopgrad = 0 t_stopgrad = 1 Hopper t_stopgrad = 10 t_stopgrad = 2 t_stopgrad = 1 Walker2d t_stopgrad = 1 t_stopgrad = 5 t_stopgrad = 2 Table 8: Results for ablation on generalization in Locomotion. Environment Unsteered Reward (Ours, Ablation) Halfcheetah 31.74 0.37 33.06 0.31 Hopper 22.95 0.81 25.03 0.79 Walker2d 30.44 1.01 42.40 1.07 Mean 28.38 33.50 E.2 Additional Results. E.2.1 Retraining of Policies in Maze2D with the Extracted Reward Function. We conducted additional experiments that show that using our extracted reward function, agents can be trained from scratch in Maze2D. We observe in table 2 that agents trained this way with PPO [60] achieve 73.72% of the rewards compared to those trained with the ground-truth reward function, underscoring the robustness of our extracted function; not that a randomly acting policy achieves close to zero performance. E.3 Ablations E.3.1 Ablation study on dataset size in Maze2D The main experiments in Maze2D were conducted with datasets of 10 million transitions. To evaluate the sensitivity of our method to dataset size, we conducted a small ablation study of 24 configurations with datasets that contained only 10 thousand transitions, hence on the order of tens of trajectories. The accuracy in this ablation was at 75.0% (as compared to 78.12% in the main experiments). Examples of visual results for the learned reward functions are shown in Figur 6. This ablation is of high relevance, as it indicates that our method can achieve good performance also from little data. E.3.2 Ablation Study on Generalization in Locomotion For the main steering experiments in Locomotion the reward functions are trained on the same base diffusion model that is then steered after. We conducted an ablation study to investigate whether the learned reward Figure 6: Heatmaps for rewards learned with 10 thousand and 1 million transition datasets. While there is slight performance degradation, the maxima of the reward still largely correspond to the ground-truth goals. function generalizes to other base models, i.e. yields significant performance increases when used to steer a base model that was not part of the training process. We trained additional base models with new seeds and steered these base diffusion models with the previously learned reward function. We report results for this ablation in Table 8. We found that the relative increase in performance is similar to that reported in the main results in Table 1 and therefore conclude that our learned reward function generalizes to new base diffusion models. E.3.3 Ablation Study on Hyerparameters in Locomotion. We conducted additional ablations with respect to t-stopgrad and the guide scale in the Locomotion environments, for a decreased and an increased value each (hence four additional experiments per environment). We observe in Table 3 that results are stable across different parameters. Appendix F Additional Results for Maze2D