Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation

Cappé, Olivier; Garivier, Aurélien; Maillard, Odalric Ambrym; Munos, Rémi; Stoltz, Gilles

Type de document :

Compte-rendu et recension critique d'ouvrage

DOI :

10.1214/13-AOS1119

Titre :

Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation

Auteur(s) :

Cappé, Olivier [Auteur]
Laboratoire Traitement et Communication de l'Information [LTCI]
Garivier, Aurélien [Auteur]
Institut de Mathématiques de Toulouse UMR5219 [IMT]
Maillard, Odalric Ambrym [Auteur]

Montanuniversität Leoben [MUL]
Munos, Rémi [Auteur]
Sequential Learning [SEQUEL]
Stoltz, Gilles [Auteur]
Computational Learning, Aggregation, Supervised Statistical, Inference, and Classification [CLASSIC]
Groupement de Recherche et d'Etudes en Gestion à HEC [GREGH]

Titre de la revue :

Annals of Statistics

Pagination :

1516-1541

Éditeur :

Institute of Mathematical Statistics

Date de publication :

2013-06

ISSN :

0090-5364

Mot(s)-clé(s) en anglais :

Multi-armed bandit problems
Upper confidence bound
Kullback-Leibler divergence
Sequential testing

Discipline(s) HAL :

Mathématiques [math]/Probabilités [math.PR]
Mathématiques [math]/Statistiques [math.ST]
Statistiques [stat]/Théorie [stat.TH]
Informatique [cs]/Apprentissage [cs.LG]
Statistiques [stat]/Machine Learning [stat.ML]

Résumé en anglais : [en]

We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins (1979), based on upper confidence bounds of the arm ...
Lire la suite >We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins (1979), based on upper confidence bounds of the arm payoffs computed using the Kullback-Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: The kl-UCB algorithm is designed for one-parameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas and Katehakis (1996), respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the state-of-the-art.Lire moins >

Langue :

Anglais

Vulgarisation :

Non

Projet Européen :

Pattern Analysis, Statistical Modelling and Computational Learning 2

Collections :