Thompson Sampling pour l’exploration-exploitation dans les systèmes linéaires

Abeille, Marc

Type de document :

Thèse

Titre :

Thompson Sampling pour l’exploration-exploitation dans les systèmes linéaires

Titre en anglais :

Exploration-Exploitation with Thompson Sampling in Linear Systems

Auteur(s) :

Abeille, Marc [Auteur]
Sequential Learning [SEQUEL]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]

Directeur(s) de thèse :

Rémi Munos

Date de soutenance :

2017-12-13

Président du jury :

Shipra Agrawal [Rapporteur]
Csaba Szepesvari [Rapporteur]
Olivier Guéant [Président]
Alessandro Lazaric [Encadrant]
Emmanuel Sérié [Encadrant]

Membre(s) du jury :

Shipra Agrawal [Rapporteur]
Csaba Szepesvari [Rapporteur]
Olivier Guéant [Président]
Alessandro Lazaric [Encadrant]
Emmanuel Sérié [Encadrant]

Organisme de délivrance :

Université de Lille 1

Mot(s)-clé(s) :

apprentissage automatique
processus décisionnel
algorithme
thompson sampling
bandit multi-bras
bandit linéaire
contrôle linéaire quadratique
apprentissage par renforcement

Mot(s)-clé(s) en anglais :

machine learning
decision-making
algorithm
multi-armed bandit
linear bandit
linear quadratic control
reinforcement learning

Discipline(s) HAL :

Mathématiques [math]

Résumé :

Cette thèse est dédiée à l’étude du Thompson Sampling (TS), une heuristique qui vise à surmonter le dilemme entre exploration et exploitation qui est inhérent à tout processus décisionnel face à l’incertain. Contrairement ...
Lire la suite >Cette thèse est dédiée à l’étude du Thompson Sampling (TS), une heuristique qui vise à surmonter le dilemme entre exploration et exploitation qui est inhérent à tout processus décisionnel face à l’incertain. Contrairement aux algorithmes issus de l’heuristique optimiste face à l’incertain (OFU), où l’exploration provient du choix du modèle le plus favorable possible au vu de la connaissance accumulée, les algorithmes TS introduisent de l’aléa dans le processus décisionnel en sélectionnant aléatoirement un modèle plausible, ce qui les rend bien moins coûteux numériquement. Cette étude se concentre sur les problèmes paramétriques linéaires, qui autorisent les espaces état-action continus (infinis), en particulier les problèmes de Bandits Linéaires (LB) et les problèmes de contrôle Linéaire et Quadratique (LQ). Nous proposons dans cette thèse de nouvelles analyses du regret des algorithmes TS pour chacun de ces deux problèmes. Bien que notre démonstration pour les LB garantisse une borne supérieure identique aux résultats préexistants, la structure de la preuve offre une nouvelle vision du fonctionnement de l’algorithme TS, et nous permet d’étendre cette analyse aux problèmes LQ. Nous démontrons la première borne supérieure pour le regret de l’algorithme TS dans les problèmes LQ, qui garantie dans le cadre fréquentiste un regret au plus d’ordre O(sqrt(T)). Enfin, nous proposons une application des méthodes d’exploration-exploitation pour les problèmes d’optimisation de portefeuille, et discutons dans ce cadre le besoin ou non d’explorer activement.Lire moins >

Résumé en anglais : [en]

This dissertation is dedicated to the study of the Thompson Sampling (TS) algorithms designed to address the exploration-exploitation dilemma that is inherent in sequential decision-making under uncertainty. As opposed to ...
Lire la suite >This dissertation is dedicated to the study of the Thompson Sampling (TS) algorithms designed to address the exploration-exploitation dilemma that is inherent in sequential decision-making under uncertainty. As opposed to algorithms based on the optimism-in-the-face-of-uncertainty (OFU) principle, where the exploration is performed by selecting the most favorable model within the set of plausible one, TS algorithms rely on randomization to enhance the exploration, and thus are much more computationally efficient. We focus on linearly parametrized problems that allow for continuous state-action spaces, namely the Linear Bandit (LB) problems and the Linear Quadratic (LQ) control problems. We derive two novel analyses for the regret of TS algorithms in those settings. While the obtained regret bound for LB is similar to previous results, the proof sheds new light on the functioning of TS, and allows us to extend the analysis to LQ problems. As a result, we prove the first regret bound for TS in LQ, and show that the frequentist regret is of order O(sqrt(T)), we matches the existing guarantee for the regret of OFU algorithms in LQ. Finally, we propose an application of exploration-exploitation techniques to the practical problem of portfolio construction, and discuss the need for active exploration in this setting.Lire moins >

Langue :

Anglais

Collections :