Improved Exploration in Factored Average-Reward MDPs

Talebi, Sadegh; Jonsson, Anders; Maillard, Odalric Ambrym

Type de document :

Communication dans un congrès avec actes

Titre :

Improved Exploration in Factored Average-Reward MDPs

Auteur(s) :

Talebi, Sadegh [Auteur]
University of Copenhagen = Københavns Universitet [UCPH]
Jonsson, Anders [Auteur]
Universitat Pompeu Fabra [Barcelona] [UPF]
Maillard, Odalric Ambrym [Auteur]

Scool [Scool]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Inria Lille - Nord Europe

Titre de la manifestation scientifique :

24th International Conference on Artificial Intelligence and Statistics

Ville :

San diego (virtual)

Pays :

Etats-Unis d'Amérique

Date de début de la manifestation scientifique :

2021

Titre de l’ouvrage :

Proceedings of the 24th International Conference on Artificial Intelligence and Statistics

Date de publication :

2021

Discipline(s) HAL :

Informatique [cs]/Intelligence artificielle [cs.AI]
Statistiques [stat]/Machine Learning [stat.ML]

Résumé en anglais : [en]

We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the stateaction space X and the state-space S ...
Lire la suite >We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the stateaction space X and the state-space S admit the respective factored forms of X = ⊗ n i=1 X i and S = ⊗ m i=1 S i , and the transition and reward functions are factored over X and S. Assuming known factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL2 strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of S i 's and the involved diameter-related terms. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys a substantially improved regret empirically over existing algorithms that have frequentist regret guarantees.Lire moins >

Langue :

Anglais

Comité de lecture :

Oui

Audience :

Internationale

Vulgarisation :

Non

Collections :

Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189

Source :

Harvested from HAL

Fichiers

https://hal.archives-ouvertes.fr/hal-03780564/document
Accès libre
Accéder au document

https://hal.archives-ouvertes.fr/hal-03780564/document
Accès libre
Accéder au document

document
Accès libre
Accéder au document

talebi21a.pdf
Accès libre
Accéder au document

document
Accès libre
Accéder au document

talebi21a.pdf
Accès libre
Accéder au document

Improved Exploration in Factored Average-Reward ... BibTeX CSV Excel RIS

Fichiers

Improved Exploration in Factored Average-Reward ...

BibTeX

CSV

Excel

RIS