Context:

  • multi-armed bandits
  • interactive online MORL

The proposed algorithm, Gaussian-process Utility Thompson Sampling (GUTS), is an approach to simultaneously learn about the environment and the (unknown) preferences of the users for multi-objective multi-armed bandits.

It uses:

  • Gaussian Processes to model the unknown user utility function
  • Thompson sampling to model the expected rewards of the arms

During the execution of GUTS, a dataset of user binary comparisons is kept, and used as a prior for the GP to estimate .

To save user time and limit the number of queries, they employ a number of tricks:

  • dismiss the query when the two arms to be compared are not statistically different
  • dismiss the query when one of the two arms to be compared is Pareto dominated by the other
  • dismiss the query when the two sets of samples and utilities agree on the optimal arm
  • add a cooldown (nb of arm pulls) before the algorithm can query the user again

|400

Experiments

They test GUTS against interactive Thompson sampling (Roijers et al., 2017) and against a theoretical baseline: single-objective Thompson sampling provided with the ground truth . Two environments are used, a handcrafted 5-arms 2-objective environment, and random 5-arm MOMABs with varying number of objectives. The results indicate that GUTS can effectively learn non-linear utility functions to minimise regret, and that it approaches TS with ground truth with a low cooldown. It also seems superior to ITS, with a clearly sub-linear regret in the number of pulls compared to ITS’s linear one. It is to be noted that the performances become increasingly linear with an increasing number of objectives.