Solution sets
The undominated set is the set of all policies for which there exist a possible utility function for which the policy’s scalarized value function is maximal. A set is a coverage set if it’s a subset of and for every , it contains a policy with maximal scalarized value.
Note
- need not be unique. When there are multiple policies whose scalarized value function is equal, a only needs one.
- A consequence of the definitions above is that is also a , although ideally we want a MORL algorithm to find a smaller one.
- There may be a policy absent from a which has a different from the value of all the policies in the , as long as there is a policy in the whose scalarized value is equal to ‘s.
For monotonically increasing utility functions (the general case), the undominated set is the Pareto front (), the set of non-dominated policies. The Pareto front has an equivalent to coverage sets for the undominated set, in which we only need to keep one of the policies that have the same value vector: Pareto Coverage Sets (). For linear utility functions, the undominated set is the Convex Hull, , and coverage sets are called Convex Coverage Sets ().
Below is a representation of the and a , unique in this case (Roijers (2013))
Relationship between the solution sets, from Hayes (2022)
Taxonomy
Let be the subset of with deterministic and stationary policies. When is linear, Roijers (2013) show that any is also a . This does not hold for monotonically increasing , which complicates the solution sets in these cases.
Non-stationary policies
White (1982) shows that in infinite-horizon discounted settings1 where the utility function is non-linear and only deterministic policies are allowed, non-stationary policies can pareto dominate stationary-ones, and thus the PCS must include non-stationary policies.
Stochastic policies
Vamplew et al. (2009) show that in an episodic MOMDP, a set of mixture policies can be constructed from policies on a such that this set is a . In particular, we look at the average return vector over many episode, when a deterministic policy is stochastically selected at the beginning of every episode. These mixture policies thus provide a continuous range of trade-offs as opposed to the discrete set of trade-offs of deterministic policies. For problems where the PS has concave regions (common occurrence, e.g. DST below on the left) mixture policies can dominate otherwise non-dominated deterministic policies.
both diagrams from : | Vamplew et al. (2009) |
As can be seen above on the right, a mixture policy will be non-dominated if and only if it is formed from base policies which are neighbouring points on the convex hull of the Pareto front. Therefore the set of base policies need consist only of Pareto-optimal policies which are not in concave regions of the Pareto front – any computation expended in finding policies in concavities is wasted as they will not be used by any non-dominated mixture policy. This suggests that linearly scalarized MORL is most efficient in this case, as it is faster for retrieving policies on the .
Wakuta (1999) shows that, in infinite-horizon MOMDPs, a is enough for monotonically increasing scalarization, not by using mixtures of deterministic policies but with stationary randomizations over deterministic stationary policies (selection of a policy at each step with identical probabilities in all states).
Warning
Just because the is enough to retrieve the entire with stochasticity, it does not mean that it’s feasible in practice. In particular, Lu (2022) show that only weight vectors normal to the can be used to produce optimal stochastic policies, and these weight are shared by all policies on the face, making it impossible to select a precise value function (choosing the right weight would produce a random policy in the set). Furthermore, this choice of weight is highly subject to numerical instability, and a small error would lead to having the same value as a deterministic policy located on a vertex.
Note
We see that the is not only important for linear utility functions, but that it is sufficient to construct a in the monotonically increasing case as well, given that we allow stochastic policies.
Summary
The above two sections can be summarized with Roijers (2013)’s taxonomy table below:
For boxes 1. and 2. (linear ), deterministic and stationary policies are sufficient to solve the MOMDP, even when stochastic and non-stationary policies are allowed.
Follows example of methods for each box of the table:
- single-objective MORL (since distributes over addition), e.g. Q-learning
- Envelope, PGMORL (They do linear interpolation between intra-family policies parameters, so solutions they find are likely suboptimal. Theoretically, they might be able to retrieve the true -in the conditions defined by Vamplew et al. (2009)- by using a mixture, as long as their method retrieves the true .)
- EUPG, where the learned policy is conditioned on the current timestep and accrued rewards
- /
- PQL (Policies found are non-stationary, using a set of non-dominated vectors for a state and action conditioned on the time step.) PCN (can be seen as such, as the RL agent follows a policy trained by supervised learning which conditions on the “desired horizon”)
- CAPQL aims at retrieving policies in (6) while using linear scalarization, by slightly modifying the reward function to make the objective space more concave.
Footnotes
-
It is mentioned in Roijers (2013) that the arguments White uses hold for finite-horizon and average-reward settings. ↩