• English
    • français
  • Help
  •  | 
  • Contact
  •  | 
  • About
  •  | 
  • Login
  • HAL portal
  •  | 
  • Pages Pro
  • EN
  •  / 
  • FR
View Item 
  •   LillOA Home
  • Liste des unités
  • Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189
  • View Item
  •   LillOA Home
  • Liste des unités
  • Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Regret bounds for restless Markov bandits
  • BibTeX
  • CSV
  • Excel
  • RIS

Document type :
Article dans une revue scientifique
DOI :
10.1016/j.tcs.2014.09.026
Title :
Regret bounds for restless Markov bandits
Author(s) :
Ortner, Ronald [Auteur]
Montanuniversität Leoben [MUL]
Ryabko, Daniil [Auteur]
Sequential Learning [SEQUEL]
Auer, Peter [Auteur]
University of Leoben [MU]
Munos, Rémi [Auteur]
Microsoft Research [Redmond]
Sequential Learning [SEQUEL]
Journal title :
Theoretical Computer Science
Pages :
62-76
Publisher :
Elsevier
Publication date :
2014
ISSN :
1879-2294
HAL domain(s) :
Informatique [cs]/Intelligence artificielle [cs.AI]
English abstract : [en]
We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm, that first represents the setting as an ...
Show more >
We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm, that first represents the setting as an MDP which exhibits some special structural properties. In order to grasp this information we introduce the notion of $\epsilon$-structured MDPs, which are a generalization of concepts like (approximate) state aggregation and MDP homomorphisms. We propose a general algorithm for learning $\epsilon$-structured MDPs and show regret bounds that demonstrate that additional structural information enhances learning. Applied to the restless bandit setting, this algorithm achieves after any $T$ steps regret of order $\tilde{O}(\sqrt{T})$ with respect to the best policy that knows the distributions of all arms. We make no assumptions on the Markov chains underlying each arm except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.Show less >
Language :
Anglais
Peer reviewed article :
Oui
Audience :
Internationale
Popular science :
Non
Collections :
  • Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189
Source :
Harvested from HAL
Files
Thumbnail
  • https://doi.org/10.1016/j.tcs.2014.09.026
  • Open access
  • Access the document
Thumbnail
  • https://doi.org/10.1016/j.tcs.2014.09.026
  • Open access
  • Access the document
Thumbnail
  • http://arxiv.org/pdf/1209.2693
  • Open access
  • Access the document
Université de Lille

Mentions légales
Université de Lille © 2017