Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Azar, Mohammad Gheshlaghi; Munos, Rémi; Kappen, Hilbert

Type de document :

Article dans une revue scientifique: Article original

DOI :

10.1007/s10994-013-5368-1

Titre :

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Auteur(s) :

Azar, Mohammad Gheshlaghi [Auteur]
Department of Medical Physics and Biophysics [Biophysics]
Munos, Rémi [Auteur]
Sequential Learning [SEQUEL]
Kappen, Hilbert [Auteur]
Department of Medical Physics and Biophysics [Biophysics]

Titre de la revue :

Machine Learning

Pagination :

325-349

Éditeur :

Springer Verlag

Date de publication :

2013

ISSN :

0885-6125

Discipline(s) HAL :

Informatique [cs]/Apprentissage [cs.LG]

Résumé en anglais : [en]

We consider the problem of learning the optimal action-value function in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement ...
Lire la suite >We consider the problem of learning the optimal action-value function in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with $N$ state-action pairs and the discount factor γin[0, 1) only $O(N log(N/δ)/ [(1 - γ)3 \epsilon^2])$ state-transition samples are required to find an $\epsilon$-optimal estimation of the action-value function with the probability (w.p.) 1-δ. Further, we prove that, for small values of $\epsilon$, an order of $O(N log(N/δ)/ [(1 - γ)3 \epsilon^2])$ samples is required to find an $\epsilon$ -optimal policy w.p. 1-δ. We also prove a matching lower bound of $\Omega(N log(N/δ)/ [(1 - γ)3\epsilon2])$ on the sample complexity of estimating the optimal action-value function. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: The upper bound matches the lower bound interms of $N$ , $\epsilon$, δ and 1/(1 -γ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1-γ).Lire moins >

Langue :

Anglais

Comité de lecture :

Oui

Audience :

Internationale

Vulgarisation :

Non

Collections :

Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189

Source :

Harvested from HAL

Fichiers

https://hal.archives-ouvertes.fr/hal-00831875/document
Accès libre
Accéder au document

https://hal.archives-ouvertes.fr/hal-00831875/document
Accès libre
Accéder au document

https://link.springer.com/content/pdf/10.1007%2Fs10994-013-5368-1.pdf
Accès libre
Accéder au document

document
Accès libre
Accéder au document

SampCompRL_MLJ2012.pdf
Accès libre
Accéder au document

document
Accès libre
Accéder au document

SampCompRL_MLJ2012.pdf
Accès libre
Accéder au document

Minimax PAC bounds on the sample complexity ... BibTeX CSV Excel RIS

Fichiers

Minimax PAC bounds on the sample complexity ...

BibTeX

CSV

Excel

RIS