Episodic reinforcement learning in finite ...
Document type :
Communication dans un congrès avec actes
Title :
Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited
Author(s) :
Domingues, Omar [Auteur]
Scool [Scool]
Ménard, Pierre [Auteur]
Otto-von-Guericke-Universität Magdeburg = Otto-von-Guericke University [Magdeburg] [OVGU]
Kaufmann, Emilie [Auteur]
Scool [Scool]
Valko, Michal [Auteur]
DeepMind [Paris]
Scool [Scool]
Ménard, Pierre [Auteur]
Otto-von-Guericke-Universität Magdeburg = Otto-von-Guericke University [Magdeburg] [OVGU]
Kaufmann, Emilie [Auteur]

Scool [Scool]
Valko, Michal [Auteur]

DeepMind [Paris]
Conference title :
Algorithmic Learning Theory
City :
Paris / Virtual
Country :
France
Start date of the conference :
2021-03-16
English keyword(s) :
reinforcement learning
episodic
lower bounds
episodic
lower bounds
HAL domain(s) :
Statistiques [stat]/Machine Learning [stat.ML]
English abstract : [en]
In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change ...
Show more >In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a lower bound of Ω((H 3 SA/ε 2) log(1/δ)) on the sample complexity of an (ε, δ)-PAC algorithm for best policy identification in a non-stationary MDP, relying on a construction of "hard MDPs" which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the Ω(√ H 3 SAT) regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.Show less >
Show more >In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a lower bound of Ω((H 3 SA/ε 2) log(1/δ)) on the sample complexity of an (ε, δ)-PAC algorithm for best policy identification in a non-stationary MDP, relying on a construction of "hard MDPs" which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the Ω(√ H 3 SAT) regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.Show less >
Language :
Anglais
Peer reviewed article :
Oui
Audience :
Internationale
Popular science :
Non
Collections :
Source :
Files
- https://hal.inria.fr/hal-03289004/document
- Open access
- Access the document
- https://hal.inria.fr/hal-03289004/document
- Open access
- Access the document
- https://hal.inria.fr/hal-03289004/document
- Open access
- Access the document
- document
- Open access
- Access the document
- domingues2021episodic.pdf
- Open access
- Access the document