This paper focuses on using intrinsic reward (IR) to improve learning stability on sparse-reward multi-agent cooperative environments by taking into account individual and team curiosity.
Info
Sparse rewards are in a sense desirable because they allow an easier design of a good and aligned reward function. On the other hand, they make reaching the goal much harder for agents, which don’t have good incentives for exploration.
Their approach build upon the Intrinsic Curiosity Module (ICM/RND comparison)
They propose a two-headed Mixed Curiosity Module (MCM), and each agent is equipped with one. The first head’s goal is to predict the agent’s next observation from its current observation and the action taken, while the second one aims at predicting the next joint observation, given the agent’s current observation and action, as well as the current joint observation and action.
The intrinsic reward is then simply the sum of both heads’ prediction errors:
Question
When talking about the joint observation and action being given to MCM’s second head, they say:
“Note that this is only passed to the second head because predicting the next observation of the corresponding agent does not need this additional information.”
This does not intuitively seem true, e.g. an agent B coming into the field of view of another agent A will disturb the prediction of its next observation, and is not agent B’s movements are not necessarily derivable from agent A’s current observation. -> non-stationarity of MARL from the perspective of any single agent
Experiments
Setting
MPE cooperative navigation environment with 1) same landmark (all agents must go to the same landmark while avoiding each other) 2) different landmarks (agents have an assigned landmark and must navigate to it while avoiding the other).
Note that in practice they did not use the full ICM architecture, and discarded the inverse model and state encoders. They argue that using slows the training, and that the whole architecture is only useful when the inputs are high dimensional and can be reduced efficiently with the state encoders. Since in the tested environments the state representation is already compact, this was not needed.
Baselines:
- vanilla COMA
- COMA+ICM-Indiv: individual prediction error for each agent
- COMA+ICM-Joint: joint prediction error shared by all agents
- COMA+ICM-Min: min individual error shared by all agent
- COMA+MCM: proposed method
The results show that vanilla COMA is not competitive with methods using an enhanced reward function in all scenarios. The proposed method performs well in all environments, whereas the other methods incorporating an intrinsic reward tend to do well only in the same or different landmark scenarios.
Ablations:
- COMA+MCM-Indiv: each agent only use the individual prediction error from its MCM
- COMA+MCM-Joint: each agent only use the joint prediction error from its MCM
- COMA+MCM-Sep: each agent use the sum of the prediction error of two ICMs, one individual one joint
The ablation studies confirm the importance of using a single MCM rather than separate ICMs for the individual and joint prediction error (COMA+MCM-Sep). It also shows the importance of having both an individual and team component, as the proposed method outperforms COMA+MCM-Indiv and COMA+MCM-Joint.