This paper introduces Iterated Pareto Referent Optimisation (IPRO), the first algorithm able to probably retrieve a -Pareto front in a finite number of state, while not making assumption on the type of MOMDP or the shape of the utility function1.
Algorithm
Steps:
- Bound the search space
- Iteratively, until a -Pareto front is found:
- select a point in the lower bound
- use it as the input to an oracle, which either returns a (weakly) Pareto-dominating solution that strictly dominates or fails
- if successful, we can remove the points to the bottom left and top right quadrant of the output
- if unsuccessful, we can remove the points to the top right quadrant of
Experiments
IPRO can obtain a PF for any policy class given that the oracle is well defined. Here they focus on memory-based policies, defined as Mealy machines where is a set of memory states, a deterministic next action function, the memory update function and the initial memory state.
The authors compare IPRO against PCN and GPI-LS, and results show that IPRO is competitive in all envs, which is a good sign given that it makes less assumptions than the 2 baselines. However, the results are not clear w.r.t. sample efficiency of the algorithm.