Small coresets via negative dependence: ...
Document type :
Communication dans un congrès avec actes
Title :
Small coresets via negative dependence: DPPs, linear statistics, and concentration
Author(s) :
Bardenet, Remi [Auteur]
Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Centre National de la Recherche Scientifique [CNRS]
Ghosh, Subhroshekhar [Auteur]
National University of Singapore [NUS]
Simon-Onfroy, Hugo [Auteur]
Université Paris-Saclay
CEA- Saclay [CEA]
Tran, Hoang-Son [Auteur]
National University of Singapore [NUS]

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 [CRIStAL]
Centre National de la Recherche Scientifique [CNRS]
Ghosh, Subhroshekhar [Auteur]
National University of Singapore [NUS]
Simon-Onfroy, Hugo [Auteur]
Université Paris-Saclay
CEA- Saclay [CEA]
Tran, Hoang-Son [Auteur]
National University of Singapore [NUS]
Conference title :
NeurIPS 2024 - Thirty-eigth Conference on Neural Information Processing Systems
City :
Vancouver, BC, Canada
Country :
Canada
Start date of the conference :
2024
Publication date :
2024
English keyword(s) :
Coresets; Determinantal point processes; Orthogonal polynomials
HAL domain(s) :
Informatique [cs]/Apprentissage [cs.LG]
Physique [physics]/Physique [physics]/Analyse de données, Statistiques et Probabilités [physics.data-an]
Physique [physics]/Physique [physics]/Analyse de données, Statistiques et Probabilités [physics.data-an]
English abstract : [en]
Determinantal point processes (DPPs) are random configurations of points with tunable negative dependence. Because sampling is tractable, DPPs are natural candidates for subsampling tasks, such as minibatch selection or ...
Show more >Determinantal point processes (DPPs) are random configurations of points with tunable negative dependence. Because sampling is tractable, DPPs are natural candidates for subsampling tasks, such as minibatch selection or coreset construction. A \emph{coreset} is a subset of a (large) training set, such that minimizing an empirical loss averaged over the coreset is a controlled replacement for the intractable minimization of the original empirical loss. Typically, the control takes the form of a guarantee that the average loss over the coreset approximates the total loss uniformly across the parameter space. Recent work has provided significant empirical support in favor of using DPPs to build randomized coresets, coupled with interesting theoretical results that are suggestive but leave some key questions unanswered. In particular, the central question of whether the cardinality of a DPP-based coreset is fundamentally smaller than one based on independent sampling remained open. In this paper, we answer this question in the affirmative, demonstrating that \emph{DPPs can provably outperform independently drawn coresets}. In this vein, we contribute a conceptual understanding of coreset loss as a \emph{linear statistic} of the (random) coreset. We leverage this structural observation to connect the coresets problem to a more general problem of concentration phenomena for linear statistics of DPPs, wherein we obtain \emph{effective concentration inequalities that extend well-beyond the state-of-the-art}, encompassing general non-projection, even non-symmetric kernels. The latter have been recently shown to be of interest in machine learning beyond coresets, but come with a limited theoretical toolbox, to the extension of which our result contributes. Finally, we are also able to address the coresets problem for vector-valued objective functions, a novelty in the coresets literature.Show less >
Show more >Determinantal point processes (DPPs) are random configurations of points with tunable negative dependence. Because sampling is tractable, DPPs are natural candidates for subsampling tasks, such as minibatch selection or coreset construction. A \emph{coreset} is a subset of a (large) training set, such that minimizing an empirical loss averaged over the coreset is a controlled replacement for the intractable minimization of the original empirical loss. Typically, the control takes the form of a guarantee that the average loss over the coreset approximates the total loss uniformly across the parameter space. Recent work has provided significant empirical support in favor of using DPPs to build randomized coresets, coupled with interesting theoretical results that are suggestive but leave some key questions unanswered. In particular, the central question of whether the cardinality of a DPP-based coreset is fundamentally smaller than one based on independent sampling remained open. In this paper, we answer this question in the affirmative, demonstrating that \emph{DPPs can provably outperform independently drawn coresets}. In this vein, we contribute a conceptual understanding of coreset loss as a \emph{linear statistic} of the (random) coreset. We leverage this structural observation to connect the coresets problem to a more general problem of concentration phenomena for linear statistics of DPPs, wherein we obtain \emph{effective concentration inequalities that extend well-beyond the state-of-the-art}, encompassing general non-projection, even non-symmetric kernels. The latter have been recently shown to be of interest in machine learning beyond coresets, but come with a limited theoretical toolbox, to the extension of which our result contributes. Finally, we are also able to address the coresets problem for vector-valued objective functions, a novelty in the coresets literature.Show less >
Language :
Anglais
Peer reviewed article :
Oui
Audience :
Internationale
Popular science :
Non
ANR Project :
Comment :
Accepted at NeurIPS 2024 (Spotlight Paper). Authors are listed in alphabetical order
Collections :
Source :
Files
- 2411.00611
- Open access
- Access the document