# Excluding the Irrelevant: Focusing Reinforcement Learning through Continuous Action Masking

## Authors
**Roland Stolz**, **Hanna Krasowski**, Jakob Thumm, Michael Eichelbeck, Philipp Gassert, Matthias Althoff.

## Prerequisites

This benchmark is developed for Python 3.9 (Ubuntu 20.04). We assume that [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) is installed. 

## Installation

```bash
sudo apt-get install libgmp3-dev
```


```bash
#Note: Not all required packages are available in conda-forge channels.
conda env create -f environment.yml
source activate action_masking
pip install -e .
```

## Run training and deployment
Seeker:
```bash
python3 action_masking/experiments/benchmark_seeker.py --approach=masking --masking-mode=ray
```

2D Quadrotor:
```bash
python action_masking/experiments/benchmark_2d_quadrotor.py --approach=masking --masking-mode=ray
```

3D Quadrotor:
```bash
python action_masking/experiments/benchmark_3d_quadrotor.py --approach=masking --masking-mode=ray
```

Select approach from [baseline, masking] and if you choose to use masking, additionally select the masking mode from [generator, ray, distribution].

By default, the benchmark trains `train_iters=10` models. Afterwards, each model is deployed for `n_eval_ep=10` episodes. Please adapt the respective hyperparams file if you want to change this. The hyperparams files are located in the hyperparams folder. 

You can visually deploy models on the Seeker environment with the script `action_masking/deploy/deploy_seeker.py`.

### Run all experiments
You can run `./tmux_<enviornment>_train.sh`, which starts independent and persistent TMux sessions for all experiments of one environment.

For each episode, tensorboard logs are generated in `tensorboard/`. The trained models are saved in `models/`. 

## Run hyperparameter optimization
Seeker:
```bash
python3 action_masking/experiments/benchmark_seeker.py --approach=masking --masking-mode=ray --optimize
```
2D Quadrotor:
```bash
python action_masking/experiments/benchmark_2d_quadrotor.py --approach=masking --masking-mode=ray --optimize
```

3D Quadrotor:
```bash
python action_masking/experiments/benchmark_3d_quadrotor.py --approach=masking --masking-mode=ray --optimize
```

Select approach from [baseline, masking] and if you choose to use masking, additionally select the masking mode from [generator, ray, distribution].

### Run all experiments
You can run `bash tmux_<enviornment>_optimize.sh`, which starts independent and persistent TMux sessions for all hyperparameter optimizations of one environment.

## Structure
```
./
├── hyperparams
├── matlab
├── action_masking
│   ├── benchmark
│   │   ├── benchmark.py
│   │   ├── benchmark_2d_quadrotor.py
│   │   ├── benchmark_3d_quadrotor.py
|   |   ├── benchmark_seeker.py
|   |   ├── plot_optuna.py
|   |   └── run_benchmarks.py
│   ├── callbacks
│   │   ├── deploy_quadrotor_callback.py
│   │   ├── seeker_callback.py
│   │   ├── tdquadrotor_callback.py
│   │   └── train_quadrotor_callback.py
│   ├── deploy
│   │   └── deploy_seeker.py
│   ├── external
│   │   └── stable-baselines3-contrib
│   │       └── sb3_contrib
│   ├── provably_safe_env
│   ├── rl_sampling
│   ├── __init__.py
│   ├── sb3_contrib -> external/stable-baselines3-contrib/sb3_contrib/
│   └── util
│       ├── __init__.py
│       ├── test_limit_function.py
│       ├── zono_safe_input_set.py
│       ├── sets.py
│       ├── safe_region.py
│       ├── tictoc.py
│       └── util.py
├── environment.yml
├── README.md
├── setup.py
├── tmux_3d_quadrotor_train.sh
├── tmux_3d_quadrotor_optimize.sh
├── tmux_2d_quadrotor_train.sh
├── tmux_2d_quadrotor_optimize.sh
├── tmux_seeker_train.sh
└── tmux_seeker_optimize.sh
```
The learning algorithms are adapted from [Stable Baselines3](https://github.com/DLR-RM/stable-baselines3).


## Refactored action masking code

After the initial submission in May 24, we started to refactor the action masking code into a proper python package. As of now for the NeurIPS 24 submission deadline, the package is not ready for a public release. However, we will release the software package in the future. If someone requires access to the unfinished code, feel free to reach out to roland.stolz@tum.de.


## Walker2D experiment
We used the new software package to conduct the Walker2D experiment. We used the standard [Walker2D-V5 environment](https://gymnasium.farama.org/environments/mujoco/walker2d/) from gymnasium. 
We approximate the unit ball relevant action set with a zonotope, by computing $36$ evenly spaced vectors around the unit ball with the Gram-Schmidt process and using them as generators for the zonotope.


## Action replacement
For action replacement, we used the implementation given in this [codeocean capsule](https://codeocean.com/capsule/6830740/tree/v1), which is the supplementary material from *Krasowski et al., Provably Safe Reinforcement Learning: Conceptual Analysis, Survey, and Benchmarking, TMLR 2023*.

## License
This project is licensed under the GPL License - see the [LICENSE](LICENSE) file for details.
