Active Roll-outs in MDP with Irreversible Dynamics
Document type :
Pré-publication ou Document de travail
Title :
Active Roll-outs in MDP with Irreversible Dynamics
Author(s) :
Maillard, Odalric Ambrym [Auteur]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Sequential Learning [SEQUEL]
Mann, Timothy [Auteur]
Google DeepMind
Ortner, Ronald [Auteur]
Montanuniversität Leoben [MUL]
Mannor, Shie [Auteur]
McGill University = Université McGill [Montréal, Canada]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Sequential Learning [SEQUEL]
Mann, Timothy [Auteur]
Google DeepMind
Ortner, Ronald [Auteur]
Montanuniversität Leoben [MUL]
Mannor, Shie [Auteur]
McGill University = Université McGill [Montréal, Canada]
English keyword(s) :
Reinforcement learning
Regret analysis
Multi-chain
MDP
Recoverability c xxxx Odalric-Ambrym
Regret analysis
Multi-chain
MDP
Recoverability c xxxx Odalric-Ambrym
HAL domain(s) :
Informatique [cs]/Intelligence artificielle [cs.AI]
Mathématiques [math]/Statistiques [math.ST]
Mathématiques [math]/Statistiques [math.ST]
English abstract : [en]
In Reinforcement Learning (RL), regret guarantees scaling with the square root of the time horizon have been shown to hold only for communicating Markov decision processes (MDPs) where any two states are connected. This ...
Show more >In Reinforcement Learning (RL), regret guarantees scaling with the square root of the time horizon have been shown to hold only for communicating Markov decision processes (MDPs) where any two states are connected. This essentially means that an algorithm can eventually recover from any mistake. However, real-world tasks usually include situations where taking a single "bad" action can permanently trap a learner in a suboptimal region of the state-space. Since it is provably impossible to achieve sub-linear regret in general multi-chain MDPs, we assume a weak mechanism that allows the learner to request additional information. Our main contribution is to address: (i) how much external information is needed, (ii) how and when to use it, and (iii) how much regret is incurred. We design an algorithm that minimizes requests for external information in the form of rollouts of a policy specified by the learner by actively requesting it only when needed. The algorithm provably achieves O(√ T) active regret after T steps in a large class of multi-chain MDPs, by only requesting O(log(T)) rollout transitions. The superiority of our algorithm to standard algorithms such as R-Max and UCRL is demonstrated in experiments on some illustrative grid-world examples. (a) (b) (c) Figure 1: Example of (a) a communicating MDP, (b) a unichain MDP with a single recurrent class, and (c) a multi-chain MDP with two recurrent classes. The circles represent states while the labeled edges represent transitions due to executing actions {a, b, c}.Show less >
Show more >In Reinforcement Learning (RL), regret guarantees scaling with the square root of the time horizon have been shown to hold only for communicating Markov decision processes (MDPs) where any two states are connected. This essentially means that an algorithm can eventually recover from any mistake. However, real-world tasks usually include situations where taking a single "bad" action can permanently trap a learner in a suboptimal region of the state-space. Since it is provably impossible to achieve sub-linear regret in general multi-chain MDPs, we assume a weak mechanism that allows the learner to request additional information. Our main contribution is to address: (i) how much external information is needed, (ii) how and when to use it, and (iii) how much regret is incurred. We design an algorithm that minimizes requests for external information in the form of rollouts of a policy specified by the learner by actively requesting it only when needed. The algorithm provably achieves O(√ T) active regret after T steps in a large class of multi-chain MDPs, by only requesting O(log(T)) rollout transitions. The superiority of our algorithm to standard algorithms such as R-Max and UCRL is demonstrated in experiments on some illustrative grid-world examples. (a) (b) (c) Figure 1: Example of (a) a communicating MDP, (b) a unichain MDP with a single recurrent class, and (c) a multi-chain MDP with two recurrent classes. The circles represent states while the labeled edges represent transitions due to executing actions {a, b, c}.Show less >
Language :
Anglais
Collections :
Source :
Files
- https://hal.archives-ouvertes.fr/hal-02177808/document
- Open access
- Access the document
- https://hal.archives-ouvertes.fr/hal-02177808/document
- Open access
- Access the document
- https://hal.archives-ouvertes.fr/hal-02177808/document
- Open access
- Access the document
- document
- Open access
- Access the document
- maillard16a.pdf
- Open access
- Access the document