Preference-based Pure Exploration

Shukla, Apurv; Basu, Debabrota

Document type :

Communication dans un congrès avec actes

Title :

Preference-based Pure Exploration

Author(s) :

Shukla, Apurv [Auteur]
Texas A&M University System
Basu, Debabrota [Auteur]
Centrale Lille
Université de Lille
Inria Lille - Nord Europe
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Scool [Scool]

Conference title :

Advances in Neural Information Processing Systems (NeurIPS)

City :

Vancouver (CA)

Country :

Canada

Start date of the conference :

2024-12

English keyword(s) :

Multi-armed Bandits
Pure exploration
Pareto front algorithm
Vector optimization
Multi-objecitve optimisation
Sample complexity
Lower bound
Preference Learning
Convex optimisation

HAL domain(s) :

Informatique [cs]/Apprentissage [cs.LG]
Informatique [cs]/Intelligence artificielle [cs.AI]
Statistiques [stat]/Théorie [stat.TH]
Statistiques [stat]/Machine Learning [stat.ML]

English abstract : [en]

We study the preference-based pure exploration problem for bandits with vector-valued rewards ordered using a preference cone $\mathcal{C}$ with the goal of identifying the most preferred policy over the set of arms. First, ...
Show more >We study the preference-based pure exploration problem for bandits with vector-valued rewards ordered using a preference cone $\mathcal{C}$ with the goal of identifying the most preferred policy over the set of arms. First, to quantify the impact of preferences, we derive a novel lower bound on the sample complexity for identifying the most preferred policy with confidence level $1-\delta$. Our lower bound elicits the role played by the geometry of the preference cone and punctuates the difference in hardness compared to best-arm variants of the problem. We further explicate this geometry when rewards follow a Gaussian distributions, and provide a convex reformulation of the lower bound. Then, we leverage this convex reformulation of the lower bound to design the Preference-based Track and Stop (PreTS) algorithm that identifies the most preferred policy. Finally, we derive a new concentration result for vector-valued rewards, and show that PreTS achieves a matching sample complexity upper bound.Show less >

Language :

Anglais

Peer reviewed article :

Oui

Audience :

Internationale

Popular science :

Non

ANR Project :

REPUBLIC: Vers l'IA responsable avec l'apprentissage par renforcement sous contraintes
Foundations of robustness and reliability in artificial intelligence
Causal eXplanations in Reinforcement Learning

Collections :