This paper introduces Iterated Pareto Referent Optimisation (IPRO), the first algorithm able to probably retrieve a -Pareto front in a finite number of state, while not making assumption on the type of MOMDP or the shape of the utility function1.

Algorithm

Steps:

  1. Bound the search space
  2. Iteratively, until a -Pareto front is found:
    1. select a point in the lower bound
    2. use it as the input to an oracle, which either returns a (weakly) Pareto-dominating solution that strictly dominates or fails
      • if successful, we can remove the points to the bottom left and top right quadrant of the output
      • if unsuccessful, we can remove the points to the top right quadrant of

Experiments

IPRO can obtain a PF for any policy class given that the oracle is well defined. Here they focus on memory-based policies, defined as Mealy machines where is a set of memory states, a deterministic next action function, the memory update function and the initial memory state.

The authors compare IPRO against PCN and GPI-LS, and results show that IPRO is competitive in all envs, which is a good sign given that it makes less assumptions than the 2 baselines. However, the results are not clear w.r.t. sample efficiency of the algorithm.

Footnotes

  1. Note that for example, PCN allows for non-linear utilities, but requires the MOMDP to be deterministic. Conversely, Envelope works for any MOMDP, but is limited to linear utility functions.