Regret Minimization in MDPs with Options ...
Type de document :
Communication dans un congrès avec actes
Titre :
Regret Minimization in MDPs with Options without Prior Knowledge
Auteur(s) :
Fruit, Ronan [Auteur]
Sequential Learning [SEQUEL]
Pirotta, Matteo [Auteur]
Sequential Learning [SEQUEL]
Lazaric, Alessandro [Auteur]
Sequential Learning [SEQUEL]
Brunskill, Emma [Auteur]
Computer Science Department - Carnegie Mellon University
Sequential Learning [SEQUEL]
Pirotta, Matteo [Auteur]
Sequential Learning [SEQUEL]
Lazaric, Alessandro [Auteur]

Sequential Learning [SEQUEL]
Brunskill, Emma [Auteur]
Computer Science Department - Carnegie Mellon University
Titre de la manifestation scientifique :
NIPS 2017 - Neural Information Processing Systems
Ville :
Long Beach
Pays :
Etats-Unis d'Amérique
Date de début de la manifestation scientifique :
2017-12-04
Date de publication :
2017-12
Discipline(s) HAL :
Statistiques [stat]/Machine Learning [stat.ML]
Résumé en anglais : [en]
The option framework integrates temporal abstraction into the reinforcement learning model through the introduction of macro-actions (i.e., options). Recent works leveraged the mapping of Markov decision processes (MDPs) ...
Lire la suite >The option framework integrates temporal abstraction into the reinforcement learning model through the introduction of macro-actions (i.e., options). Recent works leveraged the mapping of Markov decision processes (MDPs) with options to semi-MDPs (SMDPs) and introduced SMDP-versions of exploration-exploitation algorithms (e.g., RMAX-SMDP and UCRL-SMDP) to analyze the impact of options on the learning performance. Nonetheless, the PAC-SMDP sample complexity of RMAX-SMDP can hardly be translated into equivalent PAC-MDP theoretical guarantees, while the regret analysis of UCRL-SMDP requires prior knowledge of the distributions of the cumulative reward and duration of each option, which are hardly available in practice. In this paper, we remove this limitation by combining the SMDP view together with the inner Markov structure of options into a novel algorithm whose regret performance matches UCRL-SMDP's up to an additive regret term. We show scenarios where this term is negligible and the advantage of temporal abstraction is preserved. We also report preliminary empirical results supporting the theoretical findings.Lire moins >
Lire la suite >The option framework integrates temporal abstraction into the reinforcement learning model through the introduction of macro-actions (i.e., options). Recent works leveraged the mapping of Markov decision processes (MDPs) with options to semi-MDPs (SMDPs) and introduced SMDP-versions of exploration-exploitation algorithms (e.g., RMAX-SMDP and UCRL-SMDP) to analyze the impact of options on the learning performance. Nonetheless, the PAC-SMDP sample complexity of RMAX-SMDP can hardly be translated into equivalent PAC-MDP theoretical guarantees, while the regret analysis of UCRL-SMDP requires prior knowledge of the distributions of the cumulative reward and duration of each option, which are hardly available in practice. In this paper, we remove this limitation by combining the SMDP view together with the inner Markov structure of options into a novel algorithm whose regret performance matches UCRL-SMDP's up to an additive regret term. We show scenarios where this term is negligible and the advantage of temporal abstraction is preserved. We also report preliminary empirical results supporting the theoretical findings.Lire moins >
Langue :
Anglais
Comité de lecture :
Oui
Audience :
Internationale
Vulgarisation :
Non
Collections :
Source :
Fichiers
- https://hal.inria.fr/hal-01649082/document
- Accès libre
- Accéder au document
- https://hal.inria.fr/hal-01649082/document
- Accès libre
- Accéder au document
- https://hal.inria.fr/hal-01649082/document
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- supplementary.pdf
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- supplementary.pdf
- Accès libre
- Accéder au document