Trading off rewards and errors in multi-armed ...
Type de document :
Communication dans un congrès avec actes
Titre :
Trading off rewards and errors in multi-armed bandits
Auteur(s) :
Erraqabi, Akram [Auteur]
Université de Montréal [UdeM]
Sequential Learning [SEQUEL]
Lazaric, Alessandro [Auteur]
Sequential Learning [SEQUEL]
Valko, Michal [Auteur]
Sequential Learning [SEQUEL]
Brunskill, Emma [Auteur]
Computer Science Department - Carnegie Mellon University
Liu, Yun-En [Auteur]
Computer Science Department - Carnegie Mellon University
Université de Montréal [UdeM]
Sequential Learning [SEQUEL]
Lazaric, Alessandro [Auteur]

Sequential Learning [SEQUEL]
Valko, Michal [Auteur]

Sequential Learning [SEQUEL]
Brunskill, Emma [Auteur]
Computer Science Department - Carnegie Mellon University
Liu, Yun-En [Auteur]
Computer Science Department - Carnegie Mellon University
Titre de la manifestation scientifique :
International Conference on Artificial Intelligence and Statistics
Ville :
Fort Lauderdale
Pays :
Etats-Unis d'Amérique
Date de début de la manifestation scientifique :
2017
Discipline(s) HAL :
Statistiques [stat]/Machine Learning [stat.ML]
Résumé en anglais : [en]
In multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. ...
Lire la suite >In multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation errors) but, at the same time, the system needs to avoid giving a bad experience to the players, who may leave the system permanently (large reward). In this paper, we formalize this tradeoff and introduce the ForcingBalance algorithm whose performance is provably close to the best possible tradeoff strategy. Finally, we demonstrate on real-world educational data that ForcingBalance returns useful information about the arms without compromising the overall reward.Lire moins >
Lire la suite >In multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation errors) but, at the same time, the system needs to avoid giving a bad experience to the players, who may leave the system permanently (large reward). In this paper, we formalize this tradeoff and introduce the ForcingBalance algorithm whose performance is provably close to the best possible tradeoff strategy. Finally, we demonstrate on real-world educational data that ForcingBalance returns useful information about the arms without compromising the overall reward.Lire moins >
Langue :
Anglais
Comité de lecture :
Oui
Audience :
Internationale
Vulgarisation :
Non
Projet ANR :
Collections :
Source :
Fichiers
- https://hal.inria.fr/hal-01482765/document
- Accès libre
- Accéder au document
- https://hal.inria.fr/hal-01482765/document
- Accès libre
- Accéder au document
- https://hal.inria.fr/hal-01482765/document
- Accès libre
- Accéder au document
- https://hal.inria.fr/hal-01482765/document
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- erraqabi2017trading.pdf
- Accès libre
- Accéder au document
- document
- Accès libre
- Accéder au document
- erraqabi2017trading.pdf
- Accès libre
- Accéder au document