Bridging the Gap Between Imitation Learning ...
Type de document :
Compte-rendu et recension critique d'ouvrage
Titre :
Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning
Auteur(s) :
Piot, Bilal [Auteur]
DeepMind [London]
Sequential Learning [SEQUEL]
Geist, Matthieu [Auteur]
CentraleSupélec
Laboratoire Interdisciplinaire des Environnements Continentaux [LIEC]
Pietquin, Olivier [Auteur]
Sequential Learning [SEQUEL]
DeepMind [London]
DeepMind [London]
Sequential Learning [SEQUEL]
Geist, Matthieu [Auteur]
CentraleSupélec
Laboratoire Interdisciplinaire des Environnements Continentaux [LIEC]
Pietquin, Olivier [Auteur]
Sequential Learning [SEQUEL]
DeepMind [London]
Titre de la revue :
IEEE Transactions on Neural Networks and Learning Systems
Pagination :
1814 - 1826
Éditeur :
IEEE
Date de publication :
2017-08
ISSN :
2162-237X
Mot(s)-clé(s) en anglais :
Learning from Demonstrations
Inverse Reinforcement Learning
Imitation Learning
Inverse Reinforcement Learning
Imitation Learning
Discipline(s) HAL :
Statistiques [stat]/Machine Learning [stat.ML]
Résumé en anglais : [en]
—Learning from Demonstrations (LfD) is a paradigm by which an apprentice agent learns a control policy for a dynamic environment by observing demonstrations delivered by an expert agent. It is usually implemented as either ...
Lire la suite >—Learning from Demonstrations (LfD) is a paradigm by which an apprentice agent learns a control policy for a dynamic environment by observing demonstrations delivered by an expert agent. It is usually implemented as either Imitation Learning (IL) or Inverse Reinforcement Learning (IRL) in the literature. On the one hand, IRL is a paradigm relying on Markov Decision Processes (MDPs), where the goal of the apprentice agent is to find a reward function from the expert demonstrations that could explain the expert behavior. On the other hand, IL consists in directly generalizing the expert strategy, observed in the demonstrations, to unvisited states (and it is therefore close to classification, when there is a finite set of possible decisions). While these two visions are often considered as opposite to each other, the purpose of this paper is to exhibit a formal link between these approaches from which new algorithms can be derived. We show that IL and IRL can be redefined in a way that they are equivalent, in the sense that there exists an explicit bijective operator (namely the inverse optimal Bellman operator) between their respective spaces of solutions. To do so, we introduce the set-policy framework which creates a clear link between IL and IRL. As a result, IL and IRL solutions making the best of both worlds are obtained. In addition, it is a unifying framework from which existing IL and IRL algorithms can be derived and which opens the way for IL methods able to deal with the environment's dynamics. Finally, the IRL algorithms derived from the set-policy framework are compared to algorithms belonging to the more common trajectory-matching family. Experiments demonstrate that the set-policy-based algorithms outperform both standard IRL and IL ones and result in more robust solutions.Lire moins >
Lire la suite >—Learning from Demonstrations (LfD) is a paradigm by which an apprentice agent learns a control policy for a dynamic environment by observing demonstrations delivered by an expert agent. It is usually implemented as either Imitation Learning (IL) or Inverse Reinforcement Learning (IRL) in the literature. On the one hand, IRL is a paradigm relying on Markov Decision Processes (MDPs), where the goal of the apprentice agent is to find a reward function from the expert demonstrations that could explain the expert behavior. On the other hand, IL consists in directly generalizing the expert strategy, observed in the demonstrations, to unvisited states (and it is therefore close to classification, when there is a finite set of possible decisions). While these two visions are often considered as opposite to each other, the purpose of this paper is to exhibit a formal link between these approaches from which new algorithms can be derived. We show that IL and IRL can be redefined in a way that they are equivalent, in the sense that there exists an explicit bijective operator (namely the inverse optimal Bellman operator) between their respective spaces of solutions. To do so, we introduce the set-policy framework which creates a clear link between IL and IRL. As a result, IL and IRL solutions making the best of both worlds are obtained. In addition, it is a unifying framework from which existing IL and IRL algorithms can be derived and which opens the way for IL methods able to deal with the environment's dynamics. Finally, the IRL algorithms derived from the set-policy framework are compared to algorithms belonging to the more common trajectory-matching family. Experiments demonstrate that the set-policy-based algorithms outperform both standard IRL and IL ones and result in more robust solutions.Lire moins >
Langue :
Anglais
Vulgarisation :
Non
Collections :
Source :
Fichiers
- https://hal.archives-ouvertes.fr/hal-01629654/document
- Accès libre
- Accéder au document
- https://hal.archives-ouvertes.fr/hal-01629654/document
- Accès libre
- Accéder au document
- https://hal.archives-ouvertes.fr/hal-01629654/document
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- TNNLS_2016_BPMGOP.pdf
- Accès libre
- Accéder au document
- TNNLS_2016_BPMGOP.pdf
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- TNNLS_2016_BPMGOP.pdf
- Accès libre
- Accéder au document