|
3 | 3 |
|
4 | 4 | ## What’s Been Implemented? |
5 | 5 |
|
6 | | -- **Main script** for launching MPO training on top of PPO: `examples/scripts/mpo.py` |
7 | | -- **`MPOTrainer`**: Located in `trl/trainer/mpo_trainer.py`, this extends `PPOTrainer` to implement the full MPO procedure as described in the paper. |
8 | | -- **`MPOConfig`**: Defined in `trl/trainer/mpo_config.py`, this contains all hyperparameters for MPO training. |
9 | | -- **Processed corpora** for four tasks (essay writing, summarization, ethical reasoning, and mathematical reasoning) are provided in `trl/extras/mpo/corpora`. |
10 | | -- **Initial prompts and meta-prompts** for each task are located in `trl/extras/mpo/prompts`. |
11 | | -- **LLM-based reward models (RMs)** and **meta-reward models (MRMs)** are implemented in task-specific files under `trl/extras/mpo/rm_{task_name}.py`, and dataset loading/processing is handled in `trl/extras/mpo/mpo_datasets.py`. |
12 | | -- **Utility functions** for MPO training are implemented in `trl/trainer/utils.py`. |
| 6 | +- **Main script** for launching MPOPPO training on top of PPO: `examples/scripts/mpoppo.py` |
| 7 | +- **`MPOPPOTrainer`**: Located in `trl/trainer/mpoppo_trainer.py`, this extends `PPOTrainer` to implement the full MPO procedure as described in the paper. |
| 8 | +- **`MPOPPOConfig`**: Defined in `trl/trainer/mpoppo_config.py`, this contains all hyperparameters for MPOPPO training. |
| 9 | +- **Processed corpora** for four tasks (essay writing, summarization, ethical reasoning, and mathematical reasoning) are provided in `trl/extras/mpoppo/corpora`. |
| 10 | +- **Initial prompts and meta-prompts** for each task are located in `trl/extras/mpoppo/prompts`. |
| 11 | +- **LLM-based reward models (RMs)** and **meta-reward models (MRMs)** are implemented in task-specific files under `trl/extras/mpoppo/rm_{task_name}.py`, and dataset loading/processing is handled in `trl/extras/mpoppo/mpoppo_datasets.py`. |
| 12 | +- **Utility functions** for MPOPPO training are implemented in `trl/trainer/utils.py`. |
13 | 13 | - **Additional scripts** for launching remote LLM servers and evaluating trained models are provided in `scripts/mpo_experiments`. |
14 | 14 |
|
15 | 15 | ## Installation & Execution Requirements |
16 | 16 |
|
17 | | -- Running MPO requires two components: |
| 17 | +- Running MPOPPO requires two components: |
18 | 18 | 1. **A primary node or subset of GPUs** dedicated to RL training. |
19 | 19 | 2. **A separate node or the remaining GPUs** dedicated to serving reward scores in an online fashion. |
20 | 20 | - For the former, install this repository using `virtualenv` and `uv` (recommended for clean and reproducible environments): |
|
36 | 36 | $ uv pip install vllm==0.8.4 |
37 | 37 | ``` |
38 | 38 | - Refer to the [SGLang documentation](https://docs.sglang.ai/) for more details. |
39 | | -- Training start and end notifications are currently sent via [Pushover](https://pushover.net/api). If you do not wish to use this feature, you can simply comment out the relevant lines in the launch script: `examples/scripts/mpo.py`. |
40 | | -- The `launch_mpo.sh` script in `scripts/mpo_experiments` demonstrates how to train models using MPO with different parameter configurations. |
| 39 | +- Training start and end notifications are currently sent via [Pushover](https://pushover.net/api). If you do not wish to use this feature, you can simply comment out the relevant lines in the launch script: `examples/scripts/mpoppo.py`. |
| 40 | +- The `launch_mpoppo.sh` script in `scripts/mpo_experiments` demonstrates how to train models using MPOPPO with different parameter configurations. |
41 | 41 | - The `launch_rm_mrm.sh` script in `scripts/mpo_experiments` shows how to instantiate and serve LLMs via SGLang over an SSH connection. |
42 | 42 |
|
43 | 43 | Below is the README from trl repository. |
|
0 commit comments