Linear utility functions can only find policies that lie in convex regions of the Pareto optimal set. This is a significant limitations, as we most often do not know the shape of the Pareto optimal set beforehand, and thus limiting ourselves to linear utility functions could mean discarding important policies.
They propose to use the non-linear Chebyshev scalarization function: with a constantly adjusted utopian point defined as the best value so far for objective plus a small constant . This is equivalent to a weighted norm, and is used for greedy action selection in the scalarized MORL framework they propose.
They extend -learning for the multi-objective setting, with an extended table . The action selection is done using scal--greedy, which is simply -greedy on scalarized -values using the Chebyshev function.
Warning
Some papers warn that using non-linear utility function breaks the bellman optimality equation. This could threaten the convergence proofs for using this Chebyshev scalarization in a Q-learning setting. For example, Lu (2022) mentions this paper, saying that it shows that using Chebyshev metric allows the finding of Pareto efficient policies that are in the interior of the convex hull and that dominate LS policies, while in the Experiments section of the said paper they say:
“We also notice that the policies obtained by the linear scalarization method are, as expected, located in convex areas of the Pareto optimal set, while the Chebyshev function learned policies that are situated in both convex and non-convex regions.”
Additionally, Hayes et al. (2022) this invalidation of the assumed additive returns in the Bellman equation implies that non-linear utility functions cannot be used with Bellman equation-based methods under the ESR criterion.
Experiments
Environments:
- Deep Sea Treasure
- MO Mountain Car
Baseline:
- linearly scalarized Q-learning
The results indicate that using Chebyshev improves results according to a variety of metrics (of which the Hypervolume metric, cardinality of the set of solutions…).