Context:
- multi-armed bandits
- interactive online MORL
The proposed algorithm, Gaussian-process Utility Thompson Sampling (GUTS), is an approach to simultaneously learn about the environment and the (unknown) preferences of the users for multi-objective multi-armed bandits.
It uses:
- Gaussian Processes to model the unknown user utility function
- Thompson sampling to model the expected rewards of the arms
During the execution of GUTS, a dataset of user binary comparisons is kept, and used as a prior for the GP to estimate .
To save user time and limit the number of queries, they employ a number of tricks:
- dismiss the query when the two arms to be compared are not statistically different
- dismiss the query when one of the two arms to be compared is Pareto dominated by the other
- dismiss the query when the two sets of samples and utilities agree on the optimal arm
- add a cooldown (nb of arm pulls) before the algorithm can query the user again
Experiments
They test GUTS against interactive Thompson sampling (Roijers et al., 2017) and against a theoretical baseline: single-objective Thompson sampling provided with the ground truth . Two environments are used, a handcrafted 5-arms 2-objective environment, and random 5-arm MOMABs with varying number of objectives. The results indicate that GUTS can effectively learn non-linear utility functions to minimise regret, and that it approaches TS with ground truth with a low cooldown. It also seems superior to ITS, with a clearly sub-linear regret in the number of pulls compared to ITS’s linear one. It is to be noted that the performances become increasingly linear with an increasing number of objectives.