This paper explores the use of intrisic reward (IR) to improve exploration in MARL. The authors focus on cooperative partially-observable environment, and apply their method to IQL+DQN+HER.
Observation
Most IR works converge at an estimate that is supposed to scale with , but the estimation usually becomes rough for large input spaces.
They empirically observe that using IR accelerates learning, but because of the unreliability of these IR (due to rough approximation being made), the agents is prevented from finding the optimal solution.
They use a local uncertainty measure introduced by (O’Donoghue et al., 2018): with a representation of the state (e.g. the hidden state of the last layer of a network estimating Q-values). (O’Donoghue et al., 2018) propose as the intrinsic reward.
In short, this is the variance of a linear regression mapping a fixed set of state-action pairs ( used for cont. action spaces) to random labels following a Gaussian distribution. This is reminiscent of Random Network Distillation (cf. NovelD))
The architecture used is similar to COMA’s, with a centralized critic which learns to maximize a joint value function . Instead of maximizing it w.r.t. , they propose to approximate a local maximum (denoted lmax) by choosing the which maximizes each individual agent’s and using it for the next iteration’s . As in IQL, agents share .
To accelerate the change of the sampling policy in response to newly discovered states, they use a version of (Watkins, 1989) in which they do not cut the traces after exploratory steps, improving transport but introducing non-stationary targets.
Limits of intrinsic rewards (as defined above) for collaborative MARL
- Counting for all is intractable. They propose to estimate it based on , an heuristic that works with arbitrary large action spaces.
- In collaborative MARL all agents should receive the same reward to avoid diverging incentives. However the uncertainty estimation for each agent depend on its observation, to which other agents could have contributed. They propose to use the largest uncertainty as an IR for all agents.
- The agent’s value function continually change, making the representation differs at different times t (when using the example given above), and estimates thus become outdated after a while. To counter this they use an exponentially decaying average (see below).
The final collaborative intrinsic reward they use is: Here represents the hidden state of the IQL’s last layer, and tackles the 3rd limit exposed above.
Finally, they train an intrinsically rewarded central agent, while the decentralized agents (here using IQL) are trained simultaneously on the shared replay buffer. This allows decentralized agents to benefit from the exploration induced by the (potentially unreliable) IR, while keeping their policies exclusively based on environmental rewards. Additionally, the central agent conditions on the true state, which contains information potentially unattainable by the decentralized agents.
Experiments
They use a grid-world predator-prey environment, which contains mountain and valley preys. Mountain preys yield more reward, but it’s harder for our agents to catch them as actions going up the mountain have a chance of failing.
As baselines, they use versions of IQL using either no IR or IR with a magnitude of 1.
We can see below that while IR make learning faster for IQL with , the detrimental effects caused by their unreliability shows, and this version quickly stagnates both during training (left) and test (right). IQL with no IR is learns more slowly, but achieves better performances during test. Finally, ICQL seem to learn fast but stagnate quickly like IQL, but its performances during test are similar to IQL while learning much faster. This is due to the fact that during the training with ICQL, they execute 50% of the environments with the intrinsically-rewarded central agent and 50% with the decentralized agents. This eventually makes the performances stagnate as the central agent is “holding the decentralized agents back”. During the decentralized test, all environments are done by the decentralized agents, and we can see how ICQL allows them to learn much faster than regular IQL agents.