This paper proposes a new dominance relation for the ESR criterion, as well as an algorithm which learn sets of such policies.

Theory and dominance in distributional MODeM

Because they are based on expected returns, the Pareto dominance criterion and the Pareto front are not suited for the ESR setting.

First order stochastic dominance

with the cumulative distribution function (CDF)

Distributional domination

Distributional undominated set (DUS)

The DUS is the set of all non distributionally dominated policies.

Convex distributional undominated set (CDUS)

The CDUS is the set of all policies that are not distributionally dominated by a convex mixture:

The CDUS contains all policies optimal for risk averse decision makers (for all convex u there is . The solution sets relate as follows:

  • CH CDUS and CH PF
  • CDUS DUS and PF DUS

Warning

Some of the proofs are done for bivariate distributions. Do they generalize to objectives? The experiments are also done on environments with 2 objectives, which may be for theoretical or computational reasons.

The DIstributional Multi-Objective Q-learning algorithm

The authors propose an algorithm following the general framework of PQL to learn the DUS.

As in PQL, the immediate reward distributions and the expected future reward distributions (ND) are learned separately. To deal with stochastic environments, they propose to run random walks before training to estimate the transition probabilities. The action selection and scoring is more complex when dealing with the DUS rather than the PF. Classical metrics (e.g. hypervolume) can be used by first computing the expected value of the distribution. They propose another efficient scoring method: using a linear utility function as a baseline and scoring the a set of distributions by its mean expected utility (and the utility function can be substituted with a better approximation when more information is known).

An example of why the PF is not enough for ESR

See section 3.1 and appendix C of the paper.