Preference-based Pure Exploration

Shukla, Apurv; Basu, Debabrota

Type de document :

Communication dans un congrès avec actes

Titre :

Preference-based Pure Exploration

Auteur(s) :

Shukla, Apurv [Auteur]
Texas A&M University System
Basu, Debabrota [Auteur]
Centrale Lille
Université de Lille
Inria Lille - Nord Europe
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Scool [Scool]

Titre de la manifestation scientifique :

Advances in Neural Information Processing Systems (NeurIPS)

Ville :

Vancouver (CA)

Pays :

Canada

Date de début de la manifestation scientifique :

2024-12

Mot(s)-clé(s) en anglais :

Multi-armed Bandits
Pure exploration
Pareto front algorithm
Vector optimization
Multi-objecitve optimisation
Sample complexity
Lower bound
Preference Learning
Convex optimisation

Discipline(s) HAL :

Informatique [cs]/Apprentissage [cs.LG]
Informatique [cs]/Intelligence artificielle [cs.AI]
Statistiques [stat]/Théorie [stat.TH]
Statistiques [stat]/Machine Learning [stat.ML]

Résumé en anglais : [en]

We study the preference-based pure exploration problem for bandits with vector-valued rewards ordered using a preference cone $\mathcal{C}$ with the goal of identifying the most preferred policy over the set of arms. First, ...
Lire la suite >We study the preference-based pure exploration problem for bandits with vector-valued rewards ordered using a preference cone $\mathcal{C}$ with the goal of identifying the most preferred policy over the set of arms. First, to quantify the impact of preferences, we derive a novel lower bound on the sample complexity for identifying the most preferred policy with confidence level $1-\delta$. Our lower bound elicits the role played by the geometry of the preference cone and punctuates the difference in hardness compared to best-arm variants of the problem. We further explicate this geometry when rewards follow a Gaussian distributions, and provide a convex reformulation of the lower bound. Then, we leverage this convex reformulation of the lower bound to design the Preference-based Track and Stop (PreTS) algorithm that identifies the most preferred policy. Finally, we derive a new concentration result for vector-valued rewards, and show that PreTS achieves a matching sample complexity upper bound.Lire moins >

Langue :

Anglais

Comité de lecture :

Oui

Audience :

Internationale

Vulgarisation :

Non

Projet ANR :

REPUBLIC: Vers l'IA responsable avec l'apprentissage par renforcement sous contraintes
Foundations of robustness and reliability in artificial intelligence
Causal eXplanations in Reinforcement Learning

Collections :