Improved Exploration in Factored Average-Reward MDPs

Talebi, Sadegh; Jonsson, Anders; Maillard, Odalric Ambrym

Document type :

Communication dans un congrès avec actes

Title :

Improved Exploration in Factored Average-Reward MDPs

Author(s) :

Talebi, Sadegh [Auteur]
University of Copenhagen = Københavns Universitet [UCPH]
Jonsson, Anders [Auteur]
Universitat Pompeu Fabra [Barcelona] [UPF]
Maillard, Odalric Ambrym [Auteur]

Scool [Scool]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Inria Lille - Nord Europe

Conference title :

24th International Conference on Artificial Intelligence and Statistics

City :

San diego (virtual)

Country :

Etats-Unis d'Amérique

Start date of the conference :

2021

Book title :

Proceedings of the 24th International Conference on Artificial Intelligence and Statistics

Publication date :

2021

HAL domain(s) :

Informatique [cs]/Intelligence artificielle [cs.AI]
Statistiques [stat]/Machine Learning [stat.ML]

English abstract : [en]

We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the stateaction space X and the state-space S ...
Show more >We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the stateaction space X and the state-space S admit the respective factored forms of X = ⊗ n i=1 X i and S = ⊗ m i=1 S i , and the transition and reward functions are factored over X and S. Assuming known factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL2 strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of S i 's and the involved diameter-related terms. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys a substantially improved regret empirically over existing algorithms that have frequentist regret guarantees.Show less >

Language :

Anglais

Peer reviewed article :

Oui

Audience :

Internationale

Popular science :

Non

Collections :

Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189

Source :

Harvested from HAL

Files

https://hal.archives-ouvertes.fr/hal-03780564/document
Open access
Access the document

https://hal.archives-ouvertes.fr/hal-03780564/document
Open access
Access the document

document
Open access
Access the document

talebi21a.pdf
Open access
Access the document

document
Open access
Access the document

talebi21a.pdf
Open access
Access the document

Improved Exploration in Factored Average-Reward ... BibTeX CSV Excel RIS

Files

Improved Exploration in Factored Average-Reward ...

BibTeX

CSV

Excel

RIS