Regret bounds for restless Markov bandits
Type de document :
Article dans une revue scientifique: Article original
Titre :
Regret bounds for restless Markov bandits
Auteur(s) :
Ortner, Ronald [Auteur]
Montanuniversität Leoben [MUL]
Ryabko, Daniil [Auteur]
Sequential Learning [SEQUEL]
Auer, Peter [Auteur]
University of Leoben [MU]
Munos, Rémi [Auteur]
Microsoft Research [Redmond]
Sequential Learning [SEQUEL]
Montanuniversität Leoben [MUL]
Ryabko, Daniil [Auteur]
Sequential Learning [SEQUEL]
Auer, Peter [Auteur]
University of Leoben [MU]
Munos, Rémi [Auteur]
Microsoft Research [Redmond]
Sequential Learning [SEQUEL]
Titre de la revue :
Theoretical Computer Science
Pagination :
62-76
Éditeur :
Elsevier
Date de publication :
2014
ISSN :
0304-3975
Discipline(s) HAL :
Informatique [cs]/Intelligence artificielle [cs.AI]
Résumé en anglais : [en]
We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm, that first represents the setting as an ...
Lire la suite >We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm, that first represents the setting as an MDP which exhibits some special structural properties. In order to grasp this information we introduce the notion of $\epsilon$-structured MDPs, which are a generalization of concepts like (approximate) state aggregation and MDP homomorphisms. We propose a general algorithm for learning $\epsilon$-structured MDPs and show regret bounds that demonstrate that additional structural information enhances learning. Applied to the restless bandit setting, this algorithm achieves after any $T$ steps regret of order $\tilde{O}(\sqrt{T})$ with respect to the best policy that knows the distributions of all arms. We make no assumptions on the Markov chains underlying each arm except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.Lire moins >
Lire la suite >We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm, that first represents the setting as an MDP which exhibits some special structural properties. In order to grasp this information we introduce the notion of $\epsilon$-structured MDPs, which are a generalization of concepts like (approximate) state aggregation and MDP homomorphisms. We propose a general algorithm for learning $\epsilon$-structured MDPs and show regret bounds that demonstrate that additional structural information enhances learning. Applied to the restless bandit setting, this algorithm achieves after any $T$ steps regret of order $\tilde{O}(\sqrt{T})$ with respect to the best policy that knows the distributions of all arms. We make no assumptions on the Markov chains underlying each arm except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.Lire moins >
Langue :
Anglais
Comité de lecture :
Oui
Audience :
Internationale
Vulgarisation :
Non
Collections :
Source :
Fichiers
- https://doi.org/10.1016/j.tcs.2014.09.026
- Accès libre
- Accéder au document
- https://doi.org/10.1016/j.tcs.2014.09.026
- Accès libre
- Accéder au document
- http://arxiv.org/pdf/1209.2693
- Accès libre
- Accéder au document
- 1209.2693
- Accès libre
- Accéder au document