多智能体强化学习代码库汇总
1. Algorithms
We provide three types of MARL algorithms as our baselines including:
1.1 Independent Learning
- IQL
- DDPG
- PG
- A2C
- TRPO
- PPO
1.2 Centralized Critic
- COMA
- MADDPG
- MAAC
- MAPPO
- MATRPO
- HATRPO
- HAPPO
1.3 Value Decomposition
- VDN
- QMIX
- FACMAC
- VDAC
- VDPPO
Here is a chart describing the characteristics of each algorithm:
| Algorithm | Support Task Mode | Need Central Information | Discrete Action | Continuous Action | Learning Categorize | Type |
|---|---|---|---|---|---|---|
| IQL* | cooperative collaborative competitive mixed | ✔️ | Independent Learning | Off Policy | ||
| PG | cooperative collaborative competitive mixed | ✔️ | ✔️ | Independent Learning | On Policy | |
| A2C | cooperative collaborative competitive mixed | ✔️ | ✔️ | Independent Learning | On Policy | |
| DDPG | cooperative collaborative competitive mixed | ✔️ | Independent Learning | Off Policy | ||
| TRPO | cooperative collaborative competitive mixed | ✔️ | ✔️ | Independent Learning | On Policy | |
| PPO | cooperative collaborative competitive mixed | ✔️ | ✔️ | Independent Learning | On Policy | |
| COMA | cooperative collaborative competitive mixed | ✔️ | ✔️ | Centralized Critic | On Policy | |
| MADDPG | cooperative collaborative competitive mixed | ✔️ | ✔️ | Centralized Critic | Off Policy | |
| MAA2C* | cooperative collaborative competitive mixed | ✔️ | ✔️ | ✔️ | Centralized Critic | On Policy |
| MATRPO* | cooperative collaborative competitive mixed | ✔️ | ✔️ | ✔️ | Centralized Critic | On Policy |
| MAPPO | cooperative collaborative competitive mixed | ✔️ | ✔️ | ✔️ | Centralized Critic | On Policy |
| HATRPO | Cooperative | ✔️ | ✔️ | ✔️ | Centralized Critic | On Policy |
| HAPPO | Cooperative | ✔️ | ✔️ | ✔️ | Centralized Critic | On Policy |
| VDN | Cooperative | ✔️ | Value Decomposition | Off Policy | ||
| QMIX | Cooperative | ✔️ | ✔️ | Value Decomposition | Off Policy | |
| FACMAC | Cooperative | ✔️ | ✔️ | Value Decomposition | Off Policy | |
| VDAC | Cooperative | ✔️ | ✔️ | ✔️ | Value Decomposition | On Policy |
| VDPPO* | Cooperative | ✔️ | ✔️ | ✔️ | Value Decomposition | On Policy |
IQL is the multi-agent version of Q learning. MAA2C and MATRPO are the centralized version of A2C and TRPO. VDPPO is the value decomposition version of PPO.
2. Awesome Repos
Here we provide a table for the comparison of MARLlib and existing work.
| Library | Github Stars | Task Mode | Supported Env | Algorithm | Parameter Sharing | Asynchronous Interact | Framework |
|---|---|---|---|---|---|---|---|
| PyMARL | cooperative | 1 | Independent Learning(1) + Centralized Critic(1) + Value Decomposition(3) | full-sharing | * | ||
| PyMARL2 | cooperative | 1 | Independent Learning(1) + Centralized Critic(1) + Value Decomposition(9) | full-sharing | PyMARL | ||
| MARL-Algorithms | cooperative | 1 | CTDE(6) + Communication(1) + Graph(1) + Multi-task(1) | full-sharing | * | ||
| EPyMARL | cooperative | 4 | Independent Learning(3) + Value Decomposition(4) + Centralized Critic(2) | full-sharing + non-sharing | PyMARL | ||
| MAlib | self-play | 2 + PettingZoo + OpenSpiel | Population-based | full-sharing + group-sharing + non-sharing | ✔️ | * | |
| MAPPO Benchmark | cooperative | 4 | MAPPO(1) | full-sharing + non-sharing | ✔️ | pytorch-a2c-ppo-acktr-gail | |
| MARLlib | cooperative collaborative competitive mixed | 10 + PettingZoo | Independent Learning(6) + Centralized Critic(7) + Value Decomposition(5) | full-sharing + group-sharing + non-sharing | ✔️ | Ray/Rllib |
3. Some comments
- starry-sky6688
这套代码简单易上手,适合初学者入门。包含 IQL、QMIX、VDN、COMA、QTRAN、MAVEN、CommNet、DyMA-CL、G2ANet 和 MADDPG。
- pymarl
牛津大学 whiteson 组的代码库,非常模块化的代码,是那种很好用但也很难上手的类型。包括 QMIX、COMA、VDN、IQL、QTRAN。
- pymarl2 (351 星)
pymarl 的改进版本, 增加了一些 code-level tricks。
- epymarl (139 星)
epymarl 的扩展版本,在 pymarl 的基础上增加了 IA2C、IPPO、MADDPG、MAA2C、MAPPO。
- shariqiqbal2810 (307 星)
MAAC 作者写的代码,挺简洁的。包括 MADDPG、MAAC。
- marlbenchmark (509 星/102 星)
清华大学的代码库,包含 MAPPO、QMIX、VDN、MADDPG 和 MATD3,其中 VDN 和 MATD3 没有经过完整测试。
- marllib (70 星)
一个涵盖了大多主流 MARL 算法的代码库,基于 ray 的 rllib,属于那种模块化做得特别好,但上手需要花些时间的代码,包含 independence learning (IQL, A2C, DDPG, TRPO, PPO), centralized critic learning (COMA, MADDPG, MAPPO, HATRPO), and value decomposition (QMIX, VDN, FACMAC, VDA2C)。
4. Environments
Most of the popular environments in MARL research are supported by MARLlib:
| Env Name | Learning Mode | Observability | Action Space | Observations |
|---|---|---|---|---|
| LBF | cooperative + collaborative | Both | Discrete | Discrete |
| RWARE | cooperative | Partial | Discrete | Discrete |
| MPE | cooperative + collaborative + mixed | Both | Both | Continuous |
| SMAC | cooperative | Partial | Discrete | Continuous |
| MetaDrive | collaborative | Partial | Continuous | Continuous |
| MAgent | collaborative + mixed | Partial | Discrete | Discrete |
| Pommerman | collaborative + competitive + mixed | Both | Discrete | Discrete |
| MAMuJoCo | cooperative | Partial | Continuous | Continuous |
| GRF | collaborative + mixed | Full | Discrete | Continuous |
| Hanabi | cooperative | Partial | Discrete | Discrete |
Each environment has a readme file, standing as the instruction for this task, talking about env settings, installation, and some important notes.