Reward shaping

With reward shaping, the reward function is modified and a shaping reward is added, in order to help the learning algorithm to converge faster:

Potential-based reward shaping (PBRS)

Was defined in 1999 by Ng et al. as: where is a potential function returning the potential for a state . Proven to:

  • not alter the optimal policy of a single agent acting in an MDP
  • not alter the set of Nash equilibria for multiple agents in a SG
  • potential function can be changed during learning without changing the previous 2 properties

Warning

With a badly defined potential function, agents learning with PBRS can still converge to a worse joint-policy than agents learning without it.

Difference reward

aims to quantify each agent’s individual contribution to the system performance in a cooperative MAS. with the first term representing the global system utility, and the second the counterfactual global utility for a theoretical system without the contribution of agent i.

Mannion et al. (2018) extends PBRS’ theoretical guaranties to difference reward.

Intrinsic reward

Intrinsic reward add a bonus to the extrinsic reward which helps the agent with exploration. There are two main methods of improving exploration with intrinsic rewards:

  • count-based methods try to maximize exploration in state-action pairs which have not been visited often
  • prediction-base methods use the uncertainty as a bonus to encourage the agent to visit unknown areas (e.g. Böhmer (2019), Delos Reyes (2022), NovelD)