Logarithmic regret in communicating MDPs: ...
Type de document :
Communication dans un congrès avec actes
Titre :
Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits
Auteur(s) :
Saber, Hassan [Auteur]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Pesquerel, Fabien [Auteur]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Maillard, Odalric Ambrym [Auteur]
Scool [Scool]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Talebi, Mohammad [Auteur]
IT University of Copenhagen [ITU]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Pesquerel, Fabien [Auteur]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Maillard, Odalric Ambrym [Auteur]

Scool [Scool]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Talebi, Mohammad [Auteur]
IT University of Copenhagen [ITU]
Titre de la manifestation scientifique :
Asian Conference on Machine Learning
Ville :
Istanbul
Pays :
Turquie
Date de début de la manifestation scientifique :
2023-11-11
Mot(s)-clé(s) en anglais :
Average-reward Markov decision process regret minimization logarithmic regret Markov chain recurrent classes
Average-reward Markov decision process
regret minimization
logarithmic regret
Markov chain
recurrent classes
Average-reward Markov decision process
regret minimization
logarithmic regret
Markov chain
recurrent classes
Discipline(s) HAL :
Statistiques [stat]
Statistiques [stat]/Machine Learning [stat.ML]
Statistiques [stat]/Machine Learning [stat.ML]
Résumé en anglais : [en]
We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ...
Lire la suite >We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a nonasymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.Lire moins >
Lire la suite >We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a nonasymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.Lire moins >
Langue :
Anglais
Comité de lecture :
Oui
Audience :
Internationale
Vulgarisation :
Non
Collections :
Source :
Fichiers
- document
- Accès libre
- Accéder au document
- acml_2023_fabien.pdf
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- acml_2023_fabien.pdf
- Accès libre
- Accéder au document