Hyperparameter Tuning
CleanRL comes with a simple and practical hyperparameter tuning utility Tuner.
Get started
Create the following file:
from cleanrl_utils.tuner import Tuner
tuner = Tuner(
    script="cleanrl/ppo.py",
    metric="charts/episodic_return",
    metric_last_n_average_window=50,
    direction="maximize",
    target_scores={
        "CartPole-v1": [0, 500],
    },
    params_fn=lambda trial: {
        "learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
        "num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
        "update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4]),
        "num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
        "vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
        "max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
        "total-timesteps": 10000,
        "num-envs": 4,
    },
)
tuner.tune(
    num_trials=100,
    num_seeds=3,
)
Then you can run the tuner with
poetry install optuna
python tuner_example.py
Here is what happened:
- The 
tuner_example.pylaunchesnum_trials=100trials to find the best single set of hyperparameters forCartPole-v1inscript="cleanrl/ppo.py". - Each trial uses a set of hyperparameters sampled from the 
params_fnto runnum_seeds=3experiments with different random seeds, mitigating the impact of randomness on the results.- In each experiment, 
tuner_example.pyaverages the lastmetric_last_n_average_window=50reportedmetric="charts/episodic_return"to a number \(x_i\) and calculate a normalized score \(z_i = (x_i - 0) / (500 - 0)\) according totarget_scores. 
 - In each experiment, 
 - Each trial then averages the normalized scores \(z_i\) of the three experiments to a number \(z\) and the tuner optimizes \(z\) according 
direction="maximize". 
Visualization
Running python tuner_example.py will create a sqlite database containing all of the hyperparameter trials in ./cleanrl_hpopt.db. We can use optuna-dashboard to visualize the process.
poetry run optuna-dashboard sqlite:///cleanrl_hpopt.db
You can use a different database by passing Tuner(..., storage="mysql://root@localhost/example"), for example.
Work w/ multiple environments
Tuner supports finding a set of hyper parameters of that works well against multiple environments by extending target_scores. In the following example, each trial uses a set of hyperparameters to run experiments with 3 random seeds for each environment in ["CartPole-v1","Acrobot-v1"], totalling 2*3=6 experiments per trial.
from cleanrl_utils.tuner import Tuner
tuner = Tuner(
    script="cleanrl/ppo.py",
    metric="charts/episodic_return",
    metric_last_n_average_window=50,
    direction="maximize",
    target_scores={
        "CartPole-v1": [0, 500],
        "Acrobot-v1": [-500, 0],
    },
    params_fn=lambda trial: {
        "learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
        "num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
        "update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4]),
        "num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
        "vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
        "max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
        "total-timesteps": 10000,
        "num-envs": 16,
    },
)
tuner.tune(
    num_trials=100,
    num_seeds=3,
)
Info
When optimizing Atari games, you can use target_scores as the human normalized scores in (Mnih et al., 2015, Extended Data Table 2)1, such as 
tuner = Tuner(
    script="cleanrl/ppo_atari.py",
    metric="charts/episodic_return",
    metric_last_n_average_window=50,
    direction="maximize",
    target_scores={
        "Alien-v5": [227.8, 6875],
        "Amidar-v5": [5.8, 1676],
        'Assault-v5': (222.4, 1496),
        'Asterix-v5': (210.0, 8503),
        'Asteroids-v5': (719.1, 13157),
        ...
    },
    ...
)
Work w/ pruners
You can use Tuner with any pruner from optuna to prune less promising experiments:
import optuna
from cleanrl_utils.tuner import Tuner
tuner = Tuner(
    script="cleanrl/ppo.py",
    metric="charts/episodic_return",
    metric_last_n_average_window=50,
    direction="maximize",
    target_scores={
        "CartPole-v1": [0, 500],
        "Acrobot-v1": [-500, 0],
    },
    params_fn=lambda trial: {
        "learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
        "num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
        "update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4]),
        "num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
        "vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
        "max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
        "total-timesteps": 10000,
        "num-envs": 16,
    },
    pruner=optuna.pruners.MedianPruner(n_startup_trials=5),
)
tuner.tune(
    num_trials=100,
    num_seeds=3,
)
Track experiments w/ Weights and Biases
The Tuner can track all the experiments into Weights and Biases to help you visualize the progress of the tuning.
import optuna
from cleanrl_utils.tuner import Tuner
tuner = Tuner(
    script="cleanrl/ppo.py",
    metric="charts/episodic_return",
    metric_last_n_average_window=50,
    direction="maximize",
    target_scores={
        "CartPole-v1": [0, 500],
        "Acrobot-v1": [-500, 0],
    },
    params_fn=lambda trial: {
        "learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
        "num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
        "update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4]),
        "num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
        "vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
        "max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
        "total-timesteps": 10000,
        "num-envs": 16,
    },
    pruner=optuna.pruners.MedianPruner(n_startup_trials=5),
    wandb_kwargs={"project": "cleanrl"},
)
tuner.tune(
    num_trials=100,
    num_seeds=3,
)
- 
Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236 ↩