This work is at the intersection of MORL, active learning (AL) and inverse reinforcement learning (IRL).

Abstract

The paper proposes a method to learn a set of reward functions from expert demonstrations (using Adversarial IRL), and then use active learning to learn a set of weights to approximate a Pareto optimal solution in the space of expert reward functions.

Components

  • Proximal Policy Optimization
  • Adversarial Inverse Reinforcement Learning
  • Interactive MORL, specifically they:
    • start with a uniform prior
    • employ a Bradley-Terry model:
    • update the posterior in a Bayesian way with:
    • prioritize queries that remove the largest posterior volume: , estimating the expectations over with Markov Chain Monte Carlo

Additionally, to deal with the challenge of rewards functions having different scales, they propose to use the policies obtained from AIRL to normalize each reward component.

Below is the MORAL algorithm (image from paper):

|450 MORAL’s two steps (image from paper):

Experiments

To perform a qualitative study, they assume that a ground truth reward function exists, and it is used to generate the demonstrations and responses to the agent’s queries. Following related research, they consider environments which have a primary reward function , a generic and simple task to solve. is added as additional reward component to the AIRL reward functions in the interactive learning step.

Environments

Emergency

  • Example to show how MORAL can be used to incorporate social norms from a single expert alongside a primary goal
  • Grid world with 6 randomly positioned humans and a fire extinguisher cell which is the agent’s primary goal (each timestep spent in it gives +0.1)
  • The reward is found by running AIRL on synthetic demonstrations coming from a PPO agent that learned to save people, and the final reward vector is
  • 25 queries are made to the user, and given two trajectories, they will always prefer the one saving the most people, then the one with the most time spent on the extinguisher cell
  • MORAL shows to be successful to learn policies that save all people while maximizing the primary goal

Delivery

  • Example to show the effectiveness of MORAL on a larger environment with multiple norms and goals
  • Larger grid world with multiple objects, and a primary goal of doing deliveries
  • , with and the trained AIRL parameters of two conflicting social norms
  • Sampling preferences to match all ratios, they show that MORAL is able to learn very varied policies, which accurately match the preference (KL-divergence)
  • They show that MORAL is robust against adversarial examples and adhere to common implicit safe behaviors, when all marginal reward function induce safe behavior

Other

They show that MORAL is more suitable than deep reinforcement learning from human preferences (DRLHP) in multi-objective settings requiring trading off conflicting objectives from expert data. An ablation study shows that their active querying scheme outperforms queries, and that MORAL has some robustness on the face of a low amount of noise in the answer to the queries.