Reward shaping
With reward shaping, the reward function is modified and a shaping reward is added, in order to help the learning algorithm to converge faster:
Potential-based reward shaping (PBRS)
Was defined in 1999 by Ng et al. as: where is a potential function returning the potential for a state . Proven to:
- not alter the optimal policy of a single agent acting in an MDP
- not alter the set of Nash equilibria for multiple agents in a SG
- potential function can be changed during learning without changing the previous 2 properties
Warning
With a badly defined potential function, agents learning with PBRS can still converge to a worse joint-policy than agents learning without it.
Difference reward
aims to quantify each agent’s individual contribution to the system performance in a cooperative MAS. with the first term representing the global system utility, and the second the counterfactual global utility for a theoretical system without the contribution of agent i.
Mannion et al. (2018) extends PBRS’ theoretical guaranties to difference reward.
Intrinsic reward
Intrinsic reward add a bonus to the extrinsic reward which helps the agent with exploration. There are two main methods of improving exploration with intrinsic rewards:
- count-based methods try to maximize exploration in state-action pairs which have not been visited often
- prediction-base methods use the uncertainty as a bonus to encourage the agent to visit unknown areas (e.g. Böhmer (2019), Delos Reyes (2022), NovelD)