Hyperparameter Tuning
CleanRL comes with a simple and practical hyperparameter tuning utility Tuner
.
Get started
Create the following file:
from cleanrl_utils.tuner import Tuner
tuner = Tuner(
script="cleanrl/ppo.py",
metric="charts/episodic_return",
metric_last_n_average_window=50,
direction="maximize",
target_scores={
"CartPole-v1": [0, 500],
},
params_fn=lambda trial: {
"learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
"num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
"update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4]),
"num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
"vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
"max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
"total-timesteps": 10000,
"num-envs": 4,
},
)
tuner.tune(
num_trials=100,
num_seeds=3,
)
Then you can run the tuner with
poetry install optuna
python tuner_example.py
Here is what happened:
- The
tuner_example.py
launchesnum_trials=100
trials to find the best single set of hyperparameters forCartPole-v1
inscript="cleanrl/ppo.py"
. - Each trial uses a set of hyperparameters sampled from the
params_fn
to runnum_seeds=3
experiments with different random seeds, mitigating the impact of randomness on the results.- In each experiment,
tuner_example.py
averages the lastmetric_last_n_average_window=50
reportedmetric="charts/episodic_return"
to a number \(x_i\) and calculate a normalized score \(z_i = (x_i - 0) / (500 - 0)\) according totarget_scores
.
- In each experiment,
- Each trial then averages the normalized scores \(z_i\) of the three experiments to a number \(z\) and the tuner optimizes \(z\) according
direction="maximize"
.
Visualization
Running python tuner_example.py
will create a sqlite database containing all of the hyperparameter trials in ./cleanrl_hpopt.db
. We can use optuna-dashboard to visualize the process.
poetry run optuna-dashboard sqlite:///cleanrl_hpopt.db
You can use a different database by passing Tuner(..., storage="mysql://root@localhost/example")
, for example.
Work w/ multiple environments
Tuner
supports finding a set of hyper parameters of that works well against multiple environments by extending target_scores
. In the following example, each trial uses a set of hyperparameters to run experiments with 3 random seeds for each environment in ["CartPole-v1","Acrobot-v1"]
, totalling 2*3=6
experiments per trial.
from cleanrl_utils.tuner import Tuner
tuner = Tuner(
script="cleanrl/ppo.py",
metric="charts/episodic_return",
metric_last_n_average_window=50,
direction="maximize",
target_scores={
"CartPole-v1": [0, 500],
"Acrobot-v1": [-500, 0],
},
params_fn=lambda trial: {
"learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
"num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
"update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4]),
"num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
"vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
"max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
"total-timesteps": 10000,
"num-envs": 16,
},
)
tuner.tune(
num_trials=100,
num_seeds=3,
)
Info
When optimizing Atari games, you can use target_scores
as the human normalized scores in (Mnih et al., 2015, Extended Data Table 2)1, such as
tuner = Tuner(
script="cleanrl/ppo_atari.py",
metric="charts/episodic_return",
metric_last_n_average_window=50,
direction="maximize",
target_scores={
"Alien-v5": [227.8, 6875],
"Amidar-v5": [5.8, 1676],
'Assault-v5': (222.4, 1496),
'Asterix-v5': (210.0, 8503),
'Asteroids-v5': (719.1, 13157),
...
},
...
)
Work w/ pruners
You can use Tuner
with any pruner from optuna
to prune less promising experiments:
import optuna
from cleanrl_utils.tuner import Tuner
tuner = Tuner(
script="cleanrl/ppo.py",
metric="charts/episodic_return",
metric_last_n_average_window=50,
direction="maximize",
target_scores={
"CartPole-v1": [0, 500],
"Acrobot-v1": [-500, 0],
},
params_fn=lambda trial: {
"learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
"num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
"update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4]),
"num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
"vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
"max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
"total-timesteps": 10000,
"num-envs": 16,
},
pruner=optuna.pruners.MedianPruner(n_startup_trials=5),
)
tuner.tune(
num_trials=100,
num_seeds=3,
)
Track experiments w/ Weights and Biases
The Tuner
can track all the experiments into Weights and Biases to help you visualize the progress of the tuning.
import optuna
from cleanrl_utils.tuner import Tuner
tuner = Tuner(
script="cleanrl/ppo.py",
metric="charts/episodic_return",
metric_last_n_average_window=50,
direction="maximize",
target_scores={
"CartPole-v1": [0, 500],
"Acrobot-v1": [-500, 0],
},
params_fn=lambda trial: {
"learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
"num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
"update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4]),
"num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
"vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
"max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
"total-timesteps": 10000,
"num-envs": 16,
},
pruner=optuna.pruners.MedianPruner(n_startup_trials=5),
wandb_kwargs={"project": "cleanrl"},
)
tuner.tune(
num_trials=100,
num_seeds=3,
)
-
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236 ↩