# valuebased_deep_rl_scales_predictably__3abdd2a0.pdf

Value-Based Deep RL Scales Predictably

Oleh Rybkin 1 Michal Nauman 1 2 Preston Fu 1 Charlie Snell 1 Pieter Abbeel 1 Sergey Levine 1 Aviral Kumar 3

Abstract: Scaling data and compute is critical to the success of modern ML. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that valuebased off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updatesto-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling is enabled by first estimating predictable relationships between hyperparameters, which is used to manage effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on Deep Mind Control, Open AI gym, and Isaac Gym, when extrapolating to higher levels of data, compute, budget, or performance.

1. Introduction

Many latest advances in various areas of machine learning have emerged from training big models on large datasets. In this scaling guided research landscape, successfully executing even one single training run often requires a large amount of data, computational resources, and wall-clock time, such as weeks or months (Achiam et al., 2023; Team et al., 2023; Ramesh et al., 2022; Brooks et al., 2024). To

1UC Berkeley 2University of Warsaw 3CMU. Correspondence to: Oleh Rybkin <oleh.rybkin@gmail.com>, Aviral Kumar <aviralku@andrew.cmu.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

maximize the success of these large-scale runs, the trend in the machine learning (ML) community has shifted toward not just performant, but also more predictable algorithms that scale reliably with more computation and training data size, such that downstream performance can be predicted from small-scale experiments, without actually running the large-scale experiment (Mc Candlish et al., 2018; Kaplan et al., 2020; Hoffmann et al., 2023; Dubey et al., 2024).

In this paper, we study if deep reinforcement learning (RL) is also amenable to such scaling and predictability benefits. We focus on value-based methods that train value functions using temporal difference (TD) learning, which are known to be performant at small scales (Mnih et al., 2015; Lillicrap et al., 2016; Haarnoja et al., 2018). Compared to policy gradient (Mnih et al., 2016; Schulman et al., 2017) and search methods (Silver et al., 2016), value-based RL can learn from arbitrary data and require less sampling or search, which can be inefficient or infeasible for open-world problems where environment interaction is costly.

We study scaling properties by predicting relationships between different resources required for training. Data requirement D is the amount of data needed to attain a certain level of performance. Likewise, compute requirement C refers to the amount of FLOPs or gradient steps needed to attain a certain level of performance. In RL uniquely, performance can be improved by increasing either available data or compute (e.g., training multiple times on the same data), which we capture via a budget requirement that combines data and compute F = C + δ D, where δ is some constant. An additive budget function is useful when the cost of data and compute can be expressed in similar units, such as wall-clock time or required finances.

To establish scaling relationships, we first require a way to predict the best hyperparameter settings at each scale. We find that learning rate η, batch size B, and the updatesto-data (UTD) ratio σ are the most crucial hyperparameters for value-based RL. While supervised learning benefits from abundant theory to establish optimal hyperparameters (Krizhevsky, 2014; Mc Candlish et al., 2018; Yang et al., 2021), value-based RL often does not satisfy assumptions typical of supervised learning. For example, value-based RL needs to account for the non-i.i.d. nature of training data. Distribution shift due to periodic changes in the data col-

Value-Based Deep RL Scales Predictably

(I) Compute-Data Pareto frontier

(II) Budget extrapolation (III) Fits for multiple J

DMC Open AI Gym

Figure 1: Scaling properties when increasing compute C, data D, budget F, or performance J. Left: Compute versus data requirements Pareto frontier controlled by the UTD ratio σ. We observe that we can trade off data for compute and vice versa, and this relationship is predictable. Middle: Extrapolation from low to high performance. We observe that the optimal resource allocation controlled by σ evolves predictably with increasing budget, and can be used to extrapolate from low to high performance. Right: Pareto frontiers for several performance levels J.

lection policy (Levine et al., 2020) contributes to a form of overfitting where minimizing training TD error may not result in a low TD error under the data distribution induced by the new policy. In addition, objective shift due to changing target values (Dabney et al., 2021) contributes to plasticity loss (D Oro et al., 2023; Kumar et al., 2021). We show that it is possible to account for the training dynamics unique to value-based RL, and are able to find the best hyperparameters by setting the batch size and learning rate inversely proportional to the UTD ratio. We estimate this dependency using a power law (Kaplan et al., 2020), and observe that this model makes effective predictions.

Using the best predicted hyperparameters, we are now able to establish that data and compute requirements evolve as a predictable function of the UTD ratio σ. Furthermore, σ defines the tradeoff between data and compute, which can be visualized as a Pareto frontier (Figure 1, left). Using this model, we are able to extrapolate the resource requirements from low-compute to high-compute setting, as well as from low-data to high-data setting as shown in the figure.

Using the Pareto frontiers, we are now able to extrapolate

from low to high performance levels. Instead of extrapolating as a function of return, which can be arbitrary and non-smooth, we extrapolate as a function of the allowed budget F. We can define an optimal tradeoff between data and compute, and we observe that such optimal tradeoff value evolves predictably to higher budgets, which also attains a higher performance level (Figure 1, middle). Thus we are able to predict optimal hyperparameters, as well as data and compute allocation, for high-budget runs using only data from low-budget runs.

Our contribution is showing that the behavior of valuebased deep RL methods based on TD-learning is predictable in larger data and compute regimes. Specifically, we:

1. establish predictable rules for dependencies between hyperparameters batch size (B), learning rate (η), and UTD ratio (σ) in value-based RL, and show that these rules enable more effective scaling. 2. show that data and compute required to attain a given performance level lie on a Pareto frontier, and are respectively predictable in the higher-compute or higherdata regimes.

Value-Based Deep RL Scales Predictably

3. show the optimal allocation of budget between data and compute, and predict how such allocation evolves with higher budgets for best performance.

Our findings apply to algorithms such as SAC, BRO, and PQL, and domains such as the Deep Mind Control Suite (DMC), Open AI Gym, and Isaac Gym. The generality of our conclusions challenges conventional wisdom that valuebased deep RL does not scale predictably.

2. RL Preliminaries and Notation

We study standard off-policy online RL, which maximizes the agent s return by training on a replay buffer and periodically collecting new data (Sutton and Barto, 2018). Value-based deep RL methods train a Q-network, Qθ, to minimize the temporal difference (TD) error:

L(θ) = EP h r(s, a) + γ Q(s , a ) Qθ(s, a) 2i , (2.1)

where P is the replay buffer, Q is the target Q-network, s denotes a state, and a is an action drawn from a policy π( |s) that aims to maximize Qθ(s, a). We implement this operation by sampling a batch of size B from the buffer and taking a gradient step along the gradient of this loss with a learning rate η. In theory, off-policy algorithms can be made very sample efficient by minimizing the TD error fully over any data batch, which in practice translates to making more update steps to the Q-network per environment step, or higher updates-to-data ratio (UTD) (Chen et al., 2020). However, increasing the UTD ratio na ıvely can lead to worse performance (Nikishin et al., 2022; Janner et al., 2019). To this end, unlike the standard supervised learning or LLM literature that considers B and η as two main hyperparameters affecting training (Kaplan et al., 2020; Hoffmann et al., 2023), our setting presents another hyperparameter, the UTD ratio σ, that we also study in our paper.

Notation. In this paper, we focus on the following key hyperparameters: the UTD ratio σ, learning rate η, and the batch size B. We will answer questions pertaining to performance of a policy π denoted by J(π), the total data utilized by an algorithm to reach a given target level of performance J (denoted by DJ), and the total compute budget utilized by the algorithm to reach performance J (denoted by CJ), which is measured in terms of FLOPs or wall-clock time taken by the algorithm.

3. Problem Statement and Formulation

To demonstrate that the behavior of value-based RL can be predicted reliably at scale, we first post multiple resource optimization questions that guide our scaling study. Viewing data and compute as two resources, we answer questions of the form: what is the minimum value of [resource] needed to attain a given target performance? And what should the hyperparameters (e.g., B, η, σ) be in such this

training run? We will answer such questions by fitting empirical laws from low data and compute runs to determine relationships between hyperparameters. Doing so, in turn, enables us to determine how to set hyperparameters and allocate resources to maximize performance when provided with a larger data and compute budget. Note that we wish to make these hyperparameter predictions without running the large data and compute budget experiment. While questions of this form have been studied in supervised learning, the answers are different for online RL, because online RL continuously collects its own data, which ties data and compute in a complex manner and breaks i.i.d. nature of datapoints.

Concretely, we study three resource optimization questions: (1) maximizing sample efficiency (i.e., minimize the amount of data D to attain a target performance under a given compute budget), (2) conversely, minimizing compute C (e.g., FLOPs or gradient steps, whichever is more appropriate) to attain a given performance given an upper bound on data that can be collected, and (3) maximizing performance given a total bound on data and compute.

Problem 3.1 (Resource optimization problems). Find the best configuration (B, η, σ) for algorithm Alg that minimizes either the data D or compute C consumed to obtain performance J0:

1. Maximal sample efficiency:

(B , η , σ ) := arg min (B,η,σ) D

s.t. J (πAlg(B, η, σ)) J0 C C0.

2. Maximal compute efficiency:

(B , η , σ ) := arg min (B,η,σ) C

s.t. J (πAlg(B, η, σ)) J0 D D0.

We solve these problems by fitting empirical models of the minimum data and compute needed to attain a target performance for different values of J0. Doing so allows us to then solve the third setting (3) for maximizing performance given a total budget on data and compute as shown below.

Problem 3.2 (Maximize performance at large data and compute budget). Find the best configuration (B, η, σ) and resource allocations for data D and compute C that enable Alg to maximize performance at budget F0 (B , η , σ ) := arg max (B,η,σ) J (πAlg(B, η, σ))

s.t. C + δ D F0.

Value-Based Deep RL Scales Predictably

Figure 2: The data-compute tradeoff on DMC. Left: The minimum required data DJ scales with the UTD σ as a power law. Right: The minimum required compute CJ increases with the UTD σ as a sum of two power laws.

4. Scaling Results For Value-Based Deep RL

We will now present our main results addressing Problem 3.1 under the two settings discussed above. We will then use these results to present results for Problem 3.2. In order to do so, we run several experiments and estimate scaling trends from the results. Although this procedure might appear standard from scaling studies in language modeling, we found that instantiating it for value-based RL requires understanding the interaction of the various hyperparameters appearing in TD updates, and the data and compute efficiency of the algorithm. We will formalize these relationships via empirically estimated laws and show that these laws extrapolate reliably to new settings not used to obtain these empirical laws. Therefore, in this section, we present empirical and conceptual arguments to build functional forms of relationships between different hyperparameters. Before doing so, we provide our answers to Problems 3.1 and 3.2.

4.1. Main Scaling Results

We begin by answering Problem 3.1 where we maximize sample efficiency. We wish to estimate the minimal amount of data DJ needed to attain a given target performance, given an upper bound on compute C C0. To do so, we fit DJ needed to attain the target performance J = J0 parameterized by the UTD ratio σ (Eq. (4.1)). Intuitively, the minimum amount of data needed to attain a given performance is lower as more updates are made per datapoint (i.e., when σ is high), as more value could be derived from the same datapoint. In addition, we would expect that even for the best value of σ, there is a minimum number of datapoints Dmin that are needed to learn given the intrinsic difficulty of the task at hand. Based on these intuitions, we hypothesize a power law relationship between DJ(σ) and σ, with an offset Dmin and constants αJ and βJ.

DJ(σ) Dmin J + βJ

Empirical fits of DJ and σ on the DMC suite are in Figure 2 and they validate the efficacy of this fit. We also emphasize that the existence of this power law makes DJ predictable, in that we can predict DJ for larger values of σ that fall outside the range of σ values used to get the fit (Figure 6).

Scaling Observation 1: Data Requirements

The amount of data DJ needed to reach a given return target J0 decreases as a predictable function of the UTD σ, and is a power law (Eq. (4.1)).

To answer the optimization questions in Problem 3.1, we also need an expression for required compute until the target return CJ. As σ determines the number of gradient steps per data point, CJ is a function of σ. In particular, total compute is equal to the number of gradient steps taken multiplied by the parameter count of the model. Our study does not optimize over the model size and treats it as a constant. Thus, we write the compute CJ as a function of σ as:

CJ(σ) 10 N B(σ) σ DJ(σ) (4.2)

where N denotes the model size, B(σ) denotes the best choice batch size for a given UTD value σ, and other variables follow definitions from before. Note the additional factor of 10 in Eq. (4.2) emerges from the use of multiple forward passes to compute the loss function for value-based RL and the backward pass, through the Q-network (to contrast with language modeling, the typical multiplier is 6; the gap in our setting comes from the use of multiple forward passes). We plot CJ(σ) for different values of σ and J = J0 in Figure 2. Since DJ(σ) is not a constant and depends itself on σ, we note that this particular relationship between CJ(σ) and σ is not a simple power law unlike Eq. (4.1). Instead, our derivation in Eq. (A.4) shows that CJ(σ) is given by a sum of two different power laws in σ. Similarly to DJ, we also observe that the compute utilized is a predictable function of σ: we are able to accurately estimate the compute at larger values of σ using the relationship in Eq. (4.2).

Scaling Observation 2: Compute Requirements

The compute CJ to attain a given return target J0 increases as a predictable function of the UTD ratio σ, and is a sum of two power laws (Eq. (4.2)).

We observe that both required compute and data are controlled by the UTD ratio σ, which allows us to define a tradeoff between compute and data controlled by σ. We plot this tradeoff as a curve with compute CJ(σ) as x-axis and DJ(σ) as y-axis in Figure 1 (left). Further, as DJ(σ) is a monotonically decreasing function of σ, this curve defines a Pareto frontier: we can move left on the curve to increase data efficiency as the expense of compute and move right to increase compute efficiency at the expense of data. Also interestingly, due to the compute law being a sum of two power laws, in many environments there is a minimum σ after which compute efficiency no longer improves as seen on OAI Gym in Figure 1.

Solving for maximal data efficiency (Problem 3.1, (1)). We can now solve Problem 3.1 in setting (1). our strategy

Value-Based Deep RL Scales Predictably

to address setting (1) is to find the largest σ (say σmax) that satisfies the compute constraint CJ(σ) C0, and then plug this σmax into DJ(σ) to obtain the data estimate. This approach enables us to express DJ directly as a function of the available compute C0, as we calculate in Eq. (4.2). This can be visualized as finding the value DJ corresponding to some value C0 on the Pareto frontier (Figure 1, left)

Solving for maximal compute efficiency (Problem 3.1, (2)). Likewise, the solution in (2) can be obtained by finding the smallest value of σ in the range that satisfies the data constraint DJ(σ) D0, and computing the corresponding value of CJ(σ). This can similarly be visualized on the Pareto frontier (Figure 1, left). We summarize our observations in terms of the following takeaway.

Solving 3.1: The Compute-Data Pareto frontier

The UTD ratio σ defines a Pareto frontier between data and compute requirements, and estimating this frontier yields predictable solutions to resource optimization problems in settings (1) and (2). Theoretically, the optimal D J for an available compute budget C0 is:

D J(C0) C0 (10 N B(σ ) σ ) 1 . (4.3)

The optimal CJ for a given data budget D0 is:

C J(D0) 10 N B(σ ) σ D0. (4.4)

Above, σ denotes the minimizing UTD value. Calculation details are in Appendix A.

Maximize return within a budget (Problem 3.2). Finally, we tackle Problem 3.2 in order to extrapolate from low to high return. Here, we do not want to minimize resources, but rather want to maximize performance within a given total budget on data and compute. As discussed in Section 3, we consider budget functions linear in both data and compute, i.e., F = C + δ D, for a given constant δ. Our estimated Pareto frontier in Eq. (4.4) will enable answering this question. To do so, we turn to directly predicting a good UTD value σ . This UTD value is one that not only leads to maximal performance, but also stays within the total resource budget F0. Once the UTD value has been identified, it prescribes a concrete way to partition the total resource budget into good data and compute requirements using the solutions to Problem 3.1.

We plot the data-compute Pareto frontiers for multiple values of J0 in Figure 3 and in Figure 1 (right), and find that these curves move diagonally to the top-right for larger J0. Intersecting these curves with iso-budget frontiers over D and C prescribed by the budget function, gives us the largest possible J0 for which there is still a (D, C) pair that just falls just within the budget F0 but attains performance J0 (see Figure 3 for a worked out version of this procedure). Since both D and C are explained by σ, we can associate

Figure 3: Visualization of the solution to Problem 3.2. Several Pareto frontiers (Figure 1, left) are shown, together with lines of iso-budget F, which define optimal budget points (D , C ). Corresponding optimal UTD ratios σ are a predictable function of the budgets F0, trend line shown dashed.

this point with a given σ value. Hence, we can estimate the best value of σ (F0) for a given budget threshold F0. Concretely, we observe a power law between σ(F0) and F0, with constants βσ and ασ.

Solving 3.2: Maximize Return Given a Budget

The best UTD value σ that leads to maximal J is a predictable function of the budget F0 over data and compute, this relationship follows a power law, and also extrapolates to large budgets.

This relationship produces the optimal σ, and as a result, the optimal data and compute allocations to reliably attain maximum performance. As shown in Figure 1, estimating this law from low-budget experiments is sufficient for predicting good σ values for large budget runs. These predicted σ (F0) values extrapolate reliably to budgets outside the range used to fit this law (as shown by in Figure 1). This concludes an exposition of our main results.

4.2. Fitting Relationships Between (B, η, σ)

To arrive at these scaling law fits above, we had to set hyperparameters B and η, which we empirically observed to be important. We fit these hyperparameters as a function of σ, the only variable appearing in many of the scaling relationships discussed above. In this section, we will now describe how to estimate good values of B and η in terms of σ. Our analysis here relies crucially on the behavior of TD-learning that is distinct from supervised learning, where the UTD ratio σ does not exist.

To understand relationships between batch size B, learning rate η, and the UTD ratio σ, we ran an extensive grid search. We first attempted to explain the relationship between the B

Value-Based Deep RL Scales Predictably

-- Supervised Learning

TD Learning

(II) Effect of UTD ratio σ

Learning rate

(I) Hparam choice for SL vs RL

Overfitting Plasticity

(III) Effect of B and η

-- Training

Best batch size

Critical batch Size

Figure 4: Hyperparameter effects in supervised learning and TD learning on DMC. Top: Overfitting increases with UTD while batch size can be used to counteract it. Bottom: Higher UTD leads to poor training dynamics and plasticity loss (D Oro et al., 2023). Lower learning rates can be used to counteract it. While these relationships are not perfectly predictable, we use them to inform our design choices.

and η values that attain the highest data efficiency (denoted B , η ) using the standard heuristic in supervised learning: when the batch size is smaller than the critical batch size, B and η are inversely correlated with each other (Mc Candlish et al., 2018). However, as shown in Figure 5 (right), we find that without including the UTD ratio σ, best B and η exhibit very weak correlation. Further, the critical batch size (Mc Candlish et al., 2018) does not correlate with empirically best batch size as we show in Appendix F. Instead, surprisingly, we observe a strong correlation between B

and σ, as well as η and σ, respectively. Since B and η exhibit near zero correlation among themselves, we can simply omit their dependency and opt for modeling them independently as a function of the UTD ratio, σ. We conceptually explain relationships between B and σ, and η

and σ below and show that models developed from this understanding enable us to reliably predict good values of B and η, allowing us to fully answer Problem 3.1.

Predicting best choice of B in terms of σ. Our proposed functional form for the best batch size B takes the form of a power law in σ, which we also empirically validate in Figure 5 (left). We posit this form because, intuitively, large batch sizes increase the risk of overfitting because they lead to repetitive training on a fixed set of data. Furthermore, a small training loss on the distribution of data in the buffer does not necessarily reflect the behavior policy distribution of a learning agent (Levine et al., 2020). This means that minimizing the training loss to a large extent can result in poor test performance J(π), as also seen by prior work (Li

et al., 2023a; Nauman et al., 2024a). One way to counteract this form of overfitting from a high UTD value σ is to instead reduce the batch size in the run so that the training process sees a given sample fewer times. In fact, for a fixed UTD value σ, we empirically validate this hypothesis that a lower B leads to substantially reduced overfitting on several tasks in Figure 4. Hence, we post an inverse relationship between the best batch size B and the UTD value σ. We show in Figure 5 that indeed this inverse relationship can be estimated well by a power law, given formally as:

Predicting best choice of learning rate η as a function of σ. Next we turn to understanding the relationship between η and σ. We start from a simple observation: a very large σ typically leads to worse performance not only due to overfitting but also due to plasticity loss (Kumar et al., 2021; D Oro et al., 2023; Lyle et al., 2023), defined broadly as the inability of the value network to fit TD targets appearing later in training. Prior work states that plasticity loss is inherently related to the number of gradient steps performed and claims that larger norms of parameters of the Q-network are indicative of plasticity loss (D Oro et al., 2023; Lyle et al., 2023). We would expect a larger learning rate to make higher magnitude updates against the same TD target, and hence move parameters to a state that suffers from difficulty in fitting subsequent targets (Dabney et al., 2021; Lee et al., 2024). As shown in Figure 4, the parameter norm indeed increases with a high learning rate. Therefore, given a UTD

Value-Based Deep RL Scales Predictably

Figure 5: Left, middle: Fitting the best learning rate η and batch size B given UTD σ on DMC. Modeling the dependency on σ is crucial to obtain good hyperparameters, whereas using constant B, η as is commonly done leads too poor extrapolation. Right: the best learning rate and batch size are not significantly correlated, a major difference from supervised learning.

value σ, we hypothesize that the best choice of learning rate, η (σ) for a given performance should scale inversely in σ. Empirically we observe that this is indeed the case (Figure 5 (middle)), and we model this relationship:

Scaling Observation 3: Hyperparameter Selection

The best choices for the batch size and learning rate are predictable functions of the UTD σ, and both of these relationships follow a power law.

4.3. Empirical Workflow

Fitting Empirical Relationships

1. Run a sweep for batch size B and learning rate η for several values of UTD σ. Since the batch size and learning rate are independent for the best σ, we can run these sweeps independently.

2. Estimate empirically the best of batch size B and learning rate η, with statistical bootstrapping.

3. Fit B (σ) and η (σ) on B, η according to Equations (4.6) and (4.7).

4. Using the found fits B (σ), η (σ), run different values of σ that cover a range spanning an order of magnitude; we use 16 , i.e., σmax/σmin > 16.

5. Fit DJ(σ) according to Eq. (4.1).

6. Using fits of DJ(σ) for different values of J0, fit σ (F0) according to Eq. (4.5).

7. Optimal hyperparameters can now be extrapolated to larger data, larger compute, or larger budget settings according to Problem 3.1.

Having presented solutions to Problems 3.1 and 3.2, we now present the workflow we utilize to estimate these empirical fits. Further details are in Section 5 and Appendix D. This

workflow can serve as a useful skeletion for scaling law studies with other value-based algorithms as well.

4.4. Evaluating Extrapolation

Evaluating budget extrapolation. Results on all environments are shown in Figure 1 (middle). We estimate several Pareto frontiers corresponding to points with equal changes in budget. We perform the σ (F0) fit, while holding out two largest budgets. The quality of our fit for these two extrapolated budgets can be seen in the figure.

Evaluating Pareto frontier extrapolation. Results on Open AI Gym are shown in Figure 6. We fit the data efficiency equation DJ(σ) Eq. (4.1) while holding out either two UTD values σ with largest data requirement (left) or two σ values with largest compute requirement (right). The quality of our fit for these two extrapolated σ values can be seen in the figure.

Hyperparameter fit extrapolation. Results on Open AI Gym are shown in Figure 6 (right). We plot the data efficiency fit when using hyperparameters according to our found dependency B (σ), η (σ) (shown in red). These fits are estimated from σ = 1, . . . , 8 and extrapolated to σ = 0.5. We compare the typical approach of tuning hyperparameters in online RL, where hyperparameters are tuned for one setting of σ = 2 and this setting is used for all UTD values (shown in blue). We see that our proposed hyperparameter fits improve results for values other than σ = 2. Further, this improvement is larger for larger values of σ, showing that accounting for hyperparameter dependency is critical.

5. Experimental Details

Experimental Setup We focus on 12 tasks from 3 domains in our study. On Open AI Gym (Brockman et al., 2016), we use Soft Actor Critic, a commonly used TDlearning algorithm (Haarnoja et al., 2018). We first run a sweep on 5 values of η, then a grid of runs with 4 values of σ and 3 values of B, and then use hyperparameter fits

Value-Based Deep RL Scales Predictably

0.50 1.00 2.00 4.00 8.00 : UTD Ratio

J: Data until J

Empirical value Ours J( )

Constant fit J( )

Figure 6: Extrapolation towards unseen values of σ on Open AI Gym. Left: We show Pareto frontier extrapolation towards higher data regime. Middle: We show Pareto frontier extrapolation towards higher compute regime. Right: We compare the best-performing hyperparameters (red) for σ = 2 to hyperparameters predicted via our proposed workflow (blue).

to run 2 more value of σ with 8 seeds per task. To test our approach with larger models, we use DMC (Tassa et al., 2018), where, we utilize the state-of-the-art Bigger, Regularized, Optimistic (BRO) algorithm (Nauman et al., 2024b) that uses a larger and more modern architecture. We first run 5 values of B, 4 values of η, and 4 σ; and then use hyperparameters fits to run 2 more values of σ, with 10 seeds per task. Finally, we test our approach with more data on Isaac Gym (Makoviychuk et al., 2021), where we use the Parallel Q-Learning (PQL) algorithm (Li et al., 2023b), which was designed to leverage massively parallel simulation like Isaac Gym that can quickly produce billions of environment samples. Because of computational expense, we only run one Isaac Gym task. We first run 4 values of σ, 3 values of η, as well as 5 values of B, with 5 seeds per task, after which we run a second round of grid search with 7 values of σ. Further details are in Appendices B and D and Table 3.

Fitting Functional Forms for Scaling Laws We approximate Eq. (4.1) via brute-force search followed by LBFG-S with a log-MSE loss following (Hoffmann et al., 2023). For Equations (4.6) and (4.7), we fit a line in log space using least squares regression following Kaplan et al. (2020). In our experiments, we run a single fit that is shared across different tasks in a given benchmark. Specifically, we share the slope αB, αη and use task-specific intercepts σenv B , σenv η (as defined in Equations (4.6) and (4.7)) to be different for separate tasks. This technique is standard in ordinary least squares modeling and is referred to as fixed effect regression (Bishop and Nasrabadi, 2006). Sharing this slope serves the goal of variance reduction, which can be important if the granularity of the grid search over various hyperparameters run is coarse. More details are in Appendices B and D.

6. Related Work

Scaling laws and predictability. Prior work has studied scaling laws in the context of supervised learning (Kaplan et al., 2020; Hoffmann et al., 2023), primarily to predict

the effect of model size and training data on validation loss, while marginalizing out hyperparameters like batch size (Mc Candlish et al., 2018) and learning rate (Kaplan et al., 2020). There are several extensions of such scaling laws for language models, such as laws for settings with data repetition (Muennighoff et al., 2023) or mixtureof-experts (Ludziejewski et al., 2024), but most focus on cross-entropy loss, with an exception of Gadre et al. (2024), which focuses on downstream metrics. While scaling laws have guided supervised learning experiments, little work explores this for RL. The closest works are: Hilton et al. (2023) which fits power laws for on-policy RL methods using model size and the number of environment steps; Springenberg et al. (2024) who study model size scaling for offline RL; Jones (2021) which studies the scaling of Alpha Zero on board games of increasing complexity; and Gao et al. (2023) which studies reward model overoptimization in RLHF. In contrast, we are the first ones to study predictability off-policy value-based RL methods that are trained via TD-learning. Not only do off-policy methods exhibit training dynamics distinct from supervised learning and on-policy methods (Kumar et al., 2022; Lyle et al., 2023), but we show that this distinction also results in a different functional form for scaling law altogether. We also note that while Hilton et al. (2023) use minimal compute, i.e., CJ in our notation as a metric of performance, our analysis goes further in several respects: (1) we also study the tradeoff between data and compute (Figure 1), (2) we can predict the algorithm configuration for best performance (Problem 3.1); (3) we study many budget functions (C+δ D can be any affine function).

Methods for large-scale deep RL. Recent work has scaled deep RL across three axes: model size (Kumar et al., 2023; Schwarzer et al., 2023; Nauman et al., 2024b), data (Kumar et al., 2023; Gallici et al., 2024; Singla et al., 2024), and UTD (Chen et al., 2020; D Oro et al., 2023). Na ıve scaling of model size or UTD often degrades performance or causes divergence (Nikishin et al., 2022; Schwarzer et al., 2023), mitigated by classification losses (Kumar et al., 2023), layer normalization (Nauman et al., 2024a), or feature normaliza-

Value-Based Deep RL Scales Predictably

tion (Kumar et al., 2022). In our work, we use scaled network architectures from Nauman et al. (2024b) (Section 5). In on-policy RL, prior works focus on effective learning from parallelized data streams in a simulator or a world model (Mnih et al., 2016; Silver et al., 2016; Schrittwieser et al., 2020). Follow-up works like IMPALA (Espeholt et al., 2018) and SAPG (Singla et al., 2024) use a centralized learner that collects experience from distributed workers with importance sampling updates. These works differ substantially from our study as we focus exclusively on value-based off-policy RL algorithms that use TD-learning and not on-policy methods. In value-based RL, prior work on data scaling focuses on offline (Yu et al., 2022; Kumar et al., 2023; Park et al., 2024) and multi-task RL (Hafner et al., 2023). In contrast, we study online RL and fit scaling laws to answer resource optimization questions.

7. Discussion, Limitations, and Future Work

In this paper, we show that value-based deep RL algorithms scale predictably. We establish relationships between good values of hyperparameters of value-based RL. We then establish a relationship between required data and required compute for a certain performance. Finally, this allows us to determine an optimal allocation of resources to either data and compute. Although only estimated from small-scale runs, our empirical models reliably extrapolate to large compute, data, budget, or performance regimes. Despite folk wisdom to the contrary, we show it is possible to predict behavior of value-based off-policy RL algorithms at larger scale using small-scale experiments.

At the same time, this first study also presents a number of open questions and challenges:

1. While simple power law models work well, an open question remains as to whether such laws are theoretically grounded, and whether there are better and more refined functional forms. 2. Our study only focused on three hyperparameters (B, η, and σ). We do not focus on optimal tradeoff between model size and UTD, which is important for compute scaling. For data efficient RL, it is important to analyze the dependency of weight decay and weight reset frequency on UTD, which are typical tricks employed by many of the most performant methods in literature. 3. While we focus on online RL, it is important to study scaling of offline-to-online and offline RL, which will allow direct applications of scaling law findings to large model training. 4. Finally, while we study relatively small models, future work will focus on verifying our results with larger model scales, larger scale tasks, study the effect of modern architectures, and cover a larger range of compute scales spanning multiple orders of magnitude.

Our work is only one step in studying scaling laws for value-based RL methods. Further research has the potential to improve our understanding of value-based RL at scale, provide researchers with tools to focus innovation on more important components, and eventually provide guidelines towards scaling value-based RL similarly to scaling enjoyed by other modern deep learning approaches.

Acknowledgements

We would like to thank Zhang-Wei Hong, Amrith Setlur, Rishabh Agarwal, Seohong Park, and Max Simchowitz for feedback on an earlier version of this paper. We would like to thank Andrea Zanette, Seohong Park, Kyle Stachowicz, and Qiyang Li for informative discussions. This research was supported by ONR under N00014-24-12206, N00014-22-1-2773, and ONR DURIP grant, with compute support from the Berkeley Research Compute, Polish highperformance computing infrastructure, PLGrid (HPC Center: ACK Cyfronet AGH), that provided computational resources and support under grant no. PLG/2024/017817. Pieter Abbeel holds concurrent appointments as a Professor at UC Berkeley and as an Amazon Scholar. This work was done at UC Berkeley and CMU, and is not associated with Amazon.

Impact Statement

This paper aims to contribute to the advancement of reinforcement learning. While our work may have various societal implications, none warrant specific emphasis here.

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. ar Xiv preprint, 2023.

Richard E Barlow and Hugh D Brunk. The isotonic regression problem and its dual. Journal of the American Statistical Association, 1972.

Christopher M Bishop and Nasser M Nasrabadi. Pattern Recognition and Machine Learning. Springer, 2006.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI Gym, 2016.

Tim Brooks, Bill Peebles, Connor Holmes, Will De Pue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024.

Value-Based Deep RL Scales Predictably

Xinyue Chen, Che Wang, Zijian Zhou, and Keith W Ross. Randomized ensembled double Q-learning: Learning fast without a model. In International Conference on Learning Representations, 2020.

Will Dabney, Andr e Barreto, Mark Rowland, Robert Dadashi, John Quan, Marc G Bellemare, and David Silver. The value-improvement path: Towards better representations for reinforcement learning. In AAAI Conference on Artificial Intelligence, 2021.

Pierluca D Oro, Max Schwarzer, Evgenii Nikishin, Pierre Luc Bacon, Marc G Bellemare, and Aaron Courville. Sample-efficient reinforcement learning by breaking the replay ratio barrier. In International Conference on Learning Representations, 2023.

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. ar Xiv preprint, 2024.

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. International Conference on Machine Learning, 2018.

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, et al. Language models scale reliably with over-training and on downstream tasks. ar Xiv preprint, 2024.

Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning. ar Xiv preprint, 2024.

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, 2023.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. ar Xiv preprint, 2023.

Jacob Hilton, Jie Tang, and John Schulman. Scaling laws for single-agent reinforcement learning. ar Xiv preprint, 2023.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. Advances in Neural Information Processing Systems, 2023.

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, 2019.

Andy L. Jones. Scaling scaling laws with board games, 2021.

Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint, 2020.

Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. ar Xiv preprint, 2014.

Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits dataefficient deep reinforcement learning. In International Conference on Learning Representations, 2021.

Aviral Kumar, Rishabh Agarwal, Tengyu Ma, Aaron Courville, George Tucker, and Sergey Levine. DR3: Value-based deep reinforcement learning requires explicit regularization. International Conference on Learning Representations, 2022.

Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, and Sergey Levine. Offline Q-learning on diverse multi-task data both scales and generalizes. In International Conference on Learning Representations, 2023.

Hojoon Lee, Hanseul Cho, Hyunseung Kim, Daehoon Gwak, Joonkee Kim, Jaegul Choo, Se-Young Yun, and Chulhee Yun. Plastic: Improving input and label plasticity for sample efficient reinforcement learning. Advances in Neural Information Processing Systems, 2024.

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint, 2020.

Qiyang Li, Aviral Kumar, Ilya Kostrikov, and Sergey Levine. Efficient deep reinforcement learning requires regulating overfitting. In International Conference on Learning Representations, 2023a.

Zechu Li, Tao Chen, Zhang-Wei Hong, Anurag Ajay, and Pulkit Agrawal. Parallel Q-learning: Scaling off-policy reinforcement learning under massively parallel simulation. In International Conference on Machine Learning, 2023b.

Value-Based Deep RL Scales Predictably

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. International Conference on Learning Representations, 2016.

Jan Ludziejewski, Jakub Krajewski, Kamil Adamczewski, Maciej Pi oro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Kr ol, Tomasz Odrzyg o zd z, Piotr Sankowski, et al. Scaling laws for fine-grained mixture of experts. In International Conference on Machine Learning, 2024.

Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks. In International Conference on Machine Learning, 2023.

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac Gym: High performance GPU-based physics simulation for robot learning. Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021.

Sam Mc Candlish, Jared Kaplan, Dario Amodei, and Open AI Dota Team. An empirical model of large-batch training. ar Xiv preprint, 2018.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.

Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 2023.

Michal Nauman, Michał Bortkiewicz, Piotr Miło s, Tomasz Trzcinski, Mateusz Ostaszewski, and Marek Cygan. Overestimation, overfitting, and plasticity in actor-critic: The bitter lesson of reinforcement learning. In International Conference on Machine Learning, 2024a.

Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miło s, and Marek Cygan. Bigger, regularized, optimistic: Scaling for compute and sample-efficient continuous control. Advances in Neural Information Processing Systems, 2024b.

Evgenii Nikishin, Max Schwarzer, Pierluca D Oro, Pierre Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. In International Conference on Machine Learning, 2022.

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline RL? Advances in Neural Information Processing Systems, 2024.

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents. ar Xiv preprint, 2022.

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering Atari, Go, chess and Shogi by planning with a learned model. Nature, 2020.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint, 2017.

Max Schwarzer, Johan Samir Obando Ceron, Aaron Courville, Marc G Bellemare, Rishabh Agarwal, and Pablo Samuel Castro. Bigger, better, faster: Human-level Atari with human-level efficiency. In International Conference on Machine Learning, 2023.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016.

Jayesh Singla, Ananye Agarwal, and Deepak Pathak. SAPG: Split and aggregate policy gradients. International Conference on Machine Learning, 2024.

Jost Tobias Springenberg, Abbas Abdolmaleki, Jingwei Zhang, Oliver Groth, Michael Bloesch, Thomas Lampe, Philemon Brakel, Sarah Bechtle, Steven Kapturowski, Roland Hafner, et al. Offline actor-critic reinforcement learning scales to large models. International Conference on Machine Learning, 2024.

Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deep Mind control suite. ar Xiv preprint, 2018.

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al.

Value-Based Deep RL Scales Predictably

Gemini: A family of highly capable multimodal models. ar Xiv preprint, 2023.

Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm control: Software and tasks for continuous control. Software Impacts, 2020.

Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Sci Py 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 2020.

Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs V: Tuning large neural networks via zero-shot hyperparameter transfer. Advances in Neural Information Processing Systems, 2021.

Tianhe Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Chelsea Finn, and Sergey Levine. How to leverage unlabeled data in offline reinforcement learning. In International Conference on Machine Learning, 2022.

Value-Based Deep RL Scales Predictably Appendices

A. Additional details on derivations

FLOPs calculation. Recall that FLOPs per forward and backward passes are equal to Cforward J (σ) 2 N B(σ) σ DJ(σ) and Cbackward J (σ) 4 N B(σ) σ DJ(σ), with σ denoting the number of gradient steps per environment steps. Q-learning methods used in our study use MLP and Res Net architectures, which are well modeled with this approximation. Assuming same size for actor and critic as an approximation, a training iteration of the critic requires three forward passes and one backward pass, totaling Ccritic J (σ) 10 N B(σ) σ DJ(σ). A training iteration of the actor requires two forward and two backward passes, totaling Cactor J (σ) 12 N B(σ) σ DJ(σ). Here we follow the standard practice of updating the actor every time a new data point collected, while the critic is updated according to the UTD ratio σ. Since we expect the critic to be updated more then the actor. As such, in this study we assume

CJ(σ) Ccritic J (σ) 10 N B(σ) σ DJ(σ). (A.1)

Compute and sample efficiency. Following Eq. (4.1), the number of data points required to achieve performance J is equal to:

DJ(σ) Dmin J + βJ

Given the expressions for required data points, practical batch size, and FLOPs Equations (4.1), (4.6) and (A.1), we can now derive the expression for compute required to reach a particular performance expressed in terms of σ. First, note that the number of parameter updates is

σ DJ(σ) σ Dmin J + βαJ J σαJ 1 (A.3)

Combining above, Eq. (4.6) with Eq. (A.1) yields:

CJ(σ) 10 N B(σ) σ Dmin J + βαJ J σαJ 1

αB σ Dmin J + βαJ J σαJ 1

10 N Dmin J βαB B σαB 1 + βαJ J βαB B σαJ+αB 1

We observe that the resulting expression is a sum of two power laws. In practice, one of the power laws will dominate the expression and a simple mental model is that compute increases with UTD as a power law with a coefficient < 1 (see Figure 2).

Maximal compute efficiency. Here, we solve the compute optimization problem presented in Section 3. We write the problem:

(B , η , σ ) := arg min (B,η,σ) C s.t. J (πAlg(B, η, σ)) J0 D D0. (A.5)

Firstly, we formulate the Lagrangian L:

Value-Based Deep RL Scales Predictably

L(σ, λ) = CJ(σ) + λ (DJ(σ) D0)

10 N B(σ) σ Dmin J + βαJ J σαJ 1

+ λ Dmin J + βJ

Here, the constrained with respect to performance J0 is upheld through the use of CJ(σ) and DJ(σ) which are defined such that J = J0. We proceed with calculating the derivative with respect to λ to find the minimal σ that is able to achieve the desired sample efficiency DJ. We denote such optimal UTD as σ :

λ = Dmin J + βJ

αJ D0 = 0 = σ = βJ Dmin J D0 1/αJ (A.7)

Then, we substitute the σ into the expression defining compute, as well as use Eq. (4.6):

CJ(σ ) 10 N βαB B σαB 1 Dmin J + βαJ J σαJ

10 N βαB B (σ )αB 1

Dmin J + βαJ J Dmin J D0

10 N βαB B (σ )1 αB D0

Maximal sample efficiency. Firstly, we note that we treat B(σ) as a constant and do not optimize with respect to it. We start with the problem definition:

(B , η , σ ) := arg min (B,η,σ) D s.t. J (πAlg(B, η, σ)) J0 C C0. (A.9)

Similarly to the maximal compute efficiency problem, we formulate the Lagrangian L:

L(σ, λ) = DJ(σ) + λ (CJ(σ) C0)

Dmin J + βJ

αJ + λ 10 N B(σ) σ Dmin J + βαJ J σαJ

Again, we uphold the constraint with respect to the performance through the use of DJ(σ) and CJ(σ). We calculate the derivative with respect to λ:

λ = 10 N B(σ) σ Dmin J + βαJ J σαJ

C0 = 0 = Dmin J + βαJ J σαJ = C0 10 N B(σ) σ = DJ (A.11)

Since DJ is monotonic in σ and does not model impact of B on the sample efficiency, the optimization problem can be solved via Weierstrass extreme value theorem. As such, we find the biggest σ and that fulfills the compute constraint, and find the data requirement for such σ.

B. Experimental details

For our experiments, we use a total of 12 tasks from 3 benchmarks (Deep Mind Control (Tunyasuvunakool et al., 2020), Isaac Gym (Makoviychuk et al., 2021), and Open AI Gym (Brockman et al., 2016)). We list all considered tasks in Table 1.

Value-Based Deep RL Scales Predictably

Table 1: Tasks used in presented experiments.

Domain Task Optimal π Returns

Deep Mind Control Cartpole-Swingup 1000 Cheetah-Run 1000 Dog-Stand 1000 Finger-Spin 1000 Humanoid-Stand 1000 Quadruped-Walk 1000 Walker-Walk 1000

Isaac Gym Franka-Push 0.05

Open AI Gym Half Cheetah-v4 8500 Walker2d-v4 4500 Ant-v4 6625 Humanoid-v4 6125

Figure 1. We use all available UTD values for the fits, which is 6 for DMC, 5 for OAI Gym, and 7 for Isaac Gym. Given the dependency of compute and data on UTD, we plot the resulting curve. We average the data efficiencies across all tasks in each domain, as described in Appendix D. For plots on the left, we use J = 800.

We calculate compute given the model sizes of N = 4.92e6 for DMC, N = 1.5e5 for OAI Gym, and N = 2e6 following standard implementations of the respective algorithms.

For budget extrapolation, we use tradeoff values δ to mimic the wall-clock time of the algorithm. We use δ = 1e10 for DMC, δ = 5e9 for OAI Gym, and δ = 1e4 for Isaac Gym. We exclude runs affected by resets (σ = 8) for DMC since the returns right after the reset are lower, which adds noise to the results.

Figure 2. We use the same data as for DMC in Figure 1 (left).

Figure 3. We use the same data as for DMC in Figure 1 (right).

Figure 4. Left: we show an illustration that reflects our observed empirical results about the dependencies between hyperparameters.

Right, middle: we investigate the correlations between overfitting, parameter norm of the critic network, and σ. We observed the same relationships on all tasks. Here, to avoid clutter, we plot 3 tasks from DMC benchmark: cheetah-run, dog-stand, and quadruped-walk. To measure overfitting, we compare the TD loss calculated on samples randomly sampled from the buffer (corresponding to training data) to TD loss calculated on 16 newest transitions (corresponding to validation data) according to:

Overfitting = TDtraining TDvalidation. (B.1)

We fit the linear curves using ordinary least squares with mean absolute error loss.

Figure 5. In the left and central Figures, we evaluate the B and η models. For each DMC task, we find the best hyperparameters according to our workflow and procedure described in Section 5 and Appendix D. While the intercepts vary across environments, for simplicity we plot data points and fits from all environments in the same figure by shifting them with the corresponding intercept. In the right Figure, we marginalize over σ and visualize best performing pairs of B and η.

Figure 6. Here, we investigate 4 tasks from Open AI Gym, listed in Table 1, and compare the extrapolation performance of two hyperparameter sets: the best performing hyperparameters for σ = 1, found by testing 8 different hyperparameter values listed in Table 3 (we refer to this configuration as baseline); and hyperparameters predicted by our proposed models of B and η . We fit our models using σ (1, 2, 4, 8), and extrapolate to σ (0.5, 16). The graph shows the data efficiency with threshold as 700, normalized according to the procedure in Appendix D.

Value-Based Deep RL Scales Predictably

Figure 7. The goal of the left Figure is to visualize the effects of isotropic regression fit on a noisy data. We use the Sci Py package (Virtanen et al., 2020) to run the isotropic model. In the right Figure we visualize the process of best hyperparameter selection using bootstrapped confidence intervals. We describe the bootstrapping strategy in Appendix D.

C. Resulting Fits

DMC Refer to Table 5 for environment-specific values.

η = βη σ 0.26

B = βB σ 0.47

DJ = Dmin 1 + σ

σ = 1.4e8 F 0.53 0

Open AI Gym Refer to Table 5 for environment-specific values.

η = βησ 0.30

B = βBσ 0.33

DJ = Dmin 1 + σ

σ = 1.4e8 F 0.53 0

η = 8.77 1 + σ 2.57e-3

B = 38.6 1 + σ 1.42e-2

DJ = 6.8e7 1 + σ

σ = 11.3 F 0.57 0

Table 2: Coefficients for DMC and Open AI Gym fits.

Domain Task βη βB Dmin

DMC cartpole-swingup 7.55e-4 538.2 2.4e4 cheetah-run 6.25e-4 564.9 3.5e5 finger-spin 8.77e-4 608.2 2.9e4 humanoid-stand 3.86e-4 451.8 3.8e5 quadruped-walk 8.46e-4 526.4 6.2e4 walker-walk 9.38e-4 313.3 3.3e4 Open AI Gym Ant-v4 1.35e-4 447.0 2.7e5 Half Cheetah-v4 1.86e-3 415.4 7.8e4 Humanoid-v4 1.65e-4 351.6 1.8e5 Walker2d-v4 7.85e-4 399.1 1.7e5

D. Additional details on the fitting procedure

Preprocessing return values. In order to estimate the fits from our laws, we need to track the data and compute needed by a run to hit a target performance level. Due to stochasticity both in training and and evaluation, na ıve measurements of this

Value-Based Deep RL Scales Predictably

Table 3: Tested configurations.

Hyperparameters Deep Mind Control Isaac Gym Open AI Gym

Updates-to-data σ 1, 2, 4, 8 1 1024, 1 2048, 1 4096, 1 8192, 1 16384, 1 32768, 1 65536 1, 2, 4, 8 Batch size B 32, 64, 128, 256, 512 512, 1024, 2048, 4096, 8192 128, 256, 512 Learning rate η 15e-5, 3e-4, 6e-4, 12e-3 1e-4, 2e-4, 3e-4 1e-4, 2e-4, 5e-4, 1e-3, 2e-3

point can exhibit high variance. This in turn would result in low-quality fits for DJ and CJ. Thus, we preprocess the return values before estimating the fits by running isotonic regression (Barlow and Brunk, 1972). Isotonic regression transforms return values to the most aligned monotonic sequence of values that can then be used to estimate DJ. While in general return values can decrease with more training after reaching a target value, and this will result in a large deviation between the isotonic fit and true return values, the proposed isotonic transformation still suffices for us as our goal is to simply fit the minimum number of samples or compute needed to attain a target return. As we can still make reliable predictions that extrapolate to larger scales, the downstream impact of this error is clearly not substantial. We also average across random seeds before running isotonic regression to further reduce noise. We normalize the returns for all environments to be between 0 and 1000 (Table 1 lists pre-normalized returns), and reserve the points of 700 and 800 for budget extrapolation in Figure 1.

Uncertainty-adjusted optimal hyperparameters. While averaging across seeds and applying isotonic regression reduces noise, we observe that the granularity of our grid search on learning rate and batch size limits the precision of the resulting hyperparameter fits B, η. Noise due to random seed generation makes hyperparameter selection harder as some hyperparameters that appear empirically optimal might simply be so due to noise. We observe that we can correct for this precision loss by constructing a more precise estimate of B, η adjusted for this uncertainty. Specifically, we run K = 100 bootstrap estimates by sampling n random seeds with replacement out of the original n random seeds, applying isotonic regression, and selecting the optimal hyperparameters Bk, ηk. We then use the mean of this bootstrapped estimate to improve the precision:

Bbootstrap = 1

ηbootstrap = 1

We have also experimented with more precise laws for learning rate and batchsize by adding an additive offset. In this case, we follow Hoffmann et al. (2023) and fit the data using brute-force search followed by LBFG-S. We use MSE in log space as the error: MSElog(a, b) = (log a log b)2.

B (σ) Bmin + σB

η (σ) ηmin + ση

σαη . (D.3)

However, we found that this more complex fit did not validate the decrease of degrees of freedom given a limited sweep range, resulting in accuracy of extrapolation.

Independence of B and η. Whereas the optimal choice of B and η is often intertwined as UTD changes, we observe in our experiments that the correlation between them is relatively low (Figure 5). If we ran a cross-product grid search with hyperparameter space {B1, . . . , Bn B} {η1, . . . , ηnη}, we can use this fact to further improve the results by averaging the estimate B over different values of η. That is, we produce the estimate B[η=ηi] (respectively η[B=Bi]) by only looking at the runs where η = ηi, and averaging such estimates.

i η[B=Bi] (D.4)

Value-Based Deep RL Scales Predictably

Figure 7: Left: Determining performance via isotonic regression on DMC. Right: improving hyperparameter selection with uncertainty adjustment on DMC. Further details are in Appendix D.

Data efficiency. We fit data efficiency of the runs with our found practical hyperparameters B , η according to Eq. (4.1). We follow Hoffmann et al. (2023) and fit the data using brute-force search followed by LBFG-S. We use MSE in log space as the error: MSElog(a, b) = (log a log b)2.

In Deep Mind Control Suite, we would like to share the data efficiency fit across different environments env. We normalize the data efficiency D by the intra-environment median data efficiency medians Denv med = median{Denv [σ=σi]|i = 1..nσ}. For interpretability, we further re-normalize D with the overall median Dmed: Dnorm = D Dmed/Denv med. We will need to express the data efficiency law alternatively as:

DJ(σ) Dmin J

This is equivalent to Eq. (4.1) because the coefficient βJ absorbs Dmin J . However, this expression makes explicit an overall multiplicative offset1 Dmin J . Our median normalization is then equivalent to fitting per-environment coefficients Dmin J , following our procedure for environment-shared hyperparameter fits. However, we further improve robustness by fixing the per-environment coefficients to be the median data efficiency and do not require fitting them.

E. Additional experimental results

Table 4: Correlation coefficients for empirically optimal DMC hyperparameters.

R learning rate and batch size 0.04 batch size and UTD -0.40 learning rate and UTD -0.46

Table 5: Error of Pareto frontier extrapolation.

R toward larger compute 7.8% toward larger data 10.6%

Value-Based Deep RL Scales Predictably

Figure 8: Another example of isotonic regression. Using gaussian smoothing with variance σ = 3 leads to both oversmoothing (right) and undersmoothing (left).

Figure 9: Additional fit results on Open AI gym for different values of J.

Value-Based Deep RL Scales Predictably

Figure 10: An approximation of the critical batch size over training. Further details are in Appendix F.

F. Critical batch size analysis

Previous work has argued that there is a critical batch size Bcrit for neural network training in image classification, generative modeling, and reinforcement learning with policy gradient algorithms (Mc Candlish et al., 2018) a transition point at which increasing the batch size begins to yield diminishing returns. We follow this work and compute an estimate of the gradient noise scale Bnoise Bcrit according to the following procedure: throughout training, we compute the gradient norm |GB| of the critic network for batches of size B = Bsmall := 64 and B = Bbig := 1024. Then, we evaluate

|G|2 := 1 Bbig Bsmall

Bbig|GBbig|2 Bsmall|GBsmall|2

S := 1 1/Bsmall 1/Bbig

|GBsmall|2 |GBbig|2

and take Bcrit := S/|G|2. In practice, to account for the noisiness of |G|2, we first take rolling averages of |GBsmall| and |GBbig| over training, and tune the window size so that the estimates for |G|2 and S are stable.

We show the values of Bcrit over training in Figure 10. Unlike policy gradient methods, we find that the critical batch size (averaged over training) has little correlation with the optimal batch size, as shown in Figure 11.

1This form enforces that Dmin J is positive.

Value-Based Deep RL Scales Predictably

Figure 11: Bfinal vs. Bcrit, grouped by task and UTD.

Table 6: Batch size values predicted by the proposed model on DMC.

Task σ = 0.25 σ = 0.5 σ = 1 σ = 2 σ = 4 σ = 8

cartpole-swingup 1040 752 544 384 288 208 cheetah-run 1088 784 560 400 288 208 dog-stand 240 176 128 96 64 48 finger-spin 1168 848 608 432 320 224 humanoid-stand 864 624 448 320 240 176 quadruped-walk 1008 736 528 384 272 192 walker-walk 608 432 320 224 160 112

Table 7: Learning rate values predicted by the proposed model on DMC.

Task σ = 0.25 σ = 0.5 σ = 1 σ = 2 σ = 4 σ = 8

cartpole-swingup .00108 .000902 .000755 .000631 .000528 .000442 cheetah-run .000893 .000747 .000625 .000523 .000438 .000366 dog-stand .000664 .000555 .000465 .000389 .000325 .000272 finger-spin .00125 .00105 .000877 .000734 .000614 .000514 humanoid-stand .000551 .000461 .000386 .000323 .00027 .000226 quadruped-walk .00121 .00101 .000846 .000708 .000592 .000496 walker-walk .00134 .00112 .000938 .000785 .000657 .000549

Table 8: Batch size values predicted by the proposed model on Open AI Gym.

Task σ = 0.25 σ = 0.5 σ = 1 σ = 2 σ = 4 σ = 8 σ = 16

Ant-v4 704 560 448 352 288 224 176 Half Cheetah-v4 672 528 416 336 256 208 160 Humanoid-v4 560 432 352 272 224 176 144 Walker2d-v4 640 496 400 320 256 192 160

Table 9: Learning rate values predicted by the proposed model on Open AI Gym.

Task σ = 0.25 σ = 0.5 σ = 1 σ = 2 σ = 4 σ = 8 σ = 16

Ant-v4 .000206 .000167 .000138 .000109 .000087 .000070 .000060 Half Cheetah-v4 .002820 .002280 .001900 .001510 .001210 .000972 .000827 Humanoid-v4 .000251 .000203 .000169 .000134 .000107 .000086 .000073 Walker2d-v4 .001180 .000958 .000806 .000640 .000512 .000412 .000347

Value-Based Deep RL Scales Predictably

Table 10: Batch size values predicted by the proposed model on Isaac Gym.

Task σ = 1 65536 σ = 1 32768 σ = 1 16384 σ = 1 8192 σ = 1 4096 σ = 1 2048 σ = 1 1024 Franka-Push 7927 5105 3234 2030 1269 791 493

Table 11: Learning rate values predicted by the proposed model on Isaac Gym.

Task σ = 1 65536 σ = 1 32768 σ = 1 16384 σ = 1 8192 σ = 1 4096 σ = 1 2048 σ = 1 1024 Franka-Push 0.000317 0.000265 0.000221 0.000185 0.000154 0.000129 0.000107