Episodic reinforcement learning in finite ...
Type de document :
Communication dans un congrès avec actes
Titre :
Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited
Auteur(s) :
Domingues, Omar [Auteur]
Scool [Scool]
Ménard, Pierre [Auteur]
Otto-von-Guericke-Universität Magdeburg = Otto-von-Guericke University [Magdeburg] [OVGU]
Kaufmann, Emilie [Auteur]
Scool [Scool]
Valko, Michal [Auteur]
DeepMind [Paris]
Scool [Scool]
Ménard, Pierre [Auteur]
Otto-von-Guericke-Universität Magdeburg = Otto-von-Guericke University [Magdeburg] [OVGU]
Kaufmann, Emilie [Auteur]

Scool [Scool]
Valko, Michal [Auteur]

DeepMind [Paris]
Titre de la manifestation scientifique :
Algorithmic Learning Theory
Ville :
Paris / Virtual
Pays :
France
Date de début de la manifestation scientifique :
2021-03-16
Mot(s)-clé(s) en anglais :
reinforcement learning
episodic
lower bounds
episodic
lower bounds
Discipline(s) HAL :
Statistiques [stat]/Machine Learning [stat.ML]
Résumé en anglais : [en]
In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change ...
Lire la suite >In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a lower bound of Ω((H 3 SA/ε 2) log(1/δ)) on the sample complexity of an (ε, δ)-PAC algorithm for best policy identification in a non-stationary MDP, relying on a construction of "hard MDPs" which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the Ω(√ H 3 SAT) regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.Lire moins >
Lire la suite >In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a lower bound of Ω((H 3 SA/ε 2) log(1/δ)) on the sample complexity of an (ε, δ)-PAC algorithm for best policy identification in a non-stationary MDP, relying on a construction of "hard MDPs" which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the Ω(√ H 3 SAT) regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.Lire moins >
Langue :
Anglais
Comité de lecture :
Oui
Audience :
Internationale
Vulgarisation :
Non
Collections :
Source :
Fichiers
- https://hal.inria.fr/hal-03289004/document
- Accès libre
- Accéder au document
- https://hal.inria.fr/hal-03289004/document
- Accès libre
- Accéder au document
- https://hal.inria.fr/hal-03289004/document
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- domingues2021episodic.pdf
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- domingues2021episodic.pdf
- Accès libre
- Accéder au document