Contributions à la calibration d'algorithmes ...
Document type :
Habilitation à diriger des recherches
Title :
Contributions à la calibration d'algorithmes d'apprentissage : Validation-croisée et détection de ruptures
Author(s) :
Celisse, Alain [Auteur]
Laboratoire Paul Painlevé - UMR 8524 [LPP]
MOdel for Data Analysis and Learning [MODAL]
Laboratoire Paul Painlevé - UMR 8524 [LPP]
MOdel for Data Analysis and Learning [MODAL]
Thesis director(s) :
Eric Moulines
Defence date :
2018-10-09
Jury president :
Gérard Biau [Examinateur]
Gilles Blanchard [Rapporteur]
Arnak DALALYAN [Rapporteur]
Barath Sriperumbudur [Rapporteur]
Catherine Matias [Examinateur]
Christophe Biernacki [Examinateur]
Sara van de Geer [Rapporteur]
Gilles Blanchard [Rapporteur]
Arnak DALALYAN [Rapporteur]
Barath Sriperumbudur [Rapporteur]
Catherine Matias [Examinateur]
Christophe Biernacki [Examinateur]
Sara van de Geer [Rapporteur]
Jury member(s) :
Gérard Biau [Examinateur]
Gilles Blanchard [Rapporteur]
Arnak DALALYAN [Rapporteur]
Barath Sriperumbudur [Rapporteur]
Catherine Matias [Examinateur]
Christophe Biernacki [Examinateur]
Sara van de Geer [Rapporteur]
Gilles Blanchard [Rapporteur]
Arnak DALALYAN [Rapporteur]
Barath Sriperumbudur [Rapporteur]
Catherine Matias [Examinateur]
Christophe Biernacki [Examinateur]
Sara van de Geer [Rapporteur]
Accredited body :
Université de Lille
Keyword(s) :
inégalité de concentration
inégalité oracle
Sélection de modèle
détection de ruptures
Validation-croisée
noyaux reproduisants
segmentation
inégalité oracle
Sélection de modèle
détection de ruptures
Validation-croisée
noyaux reproduisants
segmentation
English keyword(s) :
reproducing kernels
concentration inequality
change-point detection
cross-validation
Model selection
oracle inequality
concentration inequality
change-point detection
cross-validation
Model selection
oracle inequality
HAL domain(s) :
Mathématiques [math]/Statistiques [math.ST]
Statistiques [stat]/Théorie [stat.TH]
Statistiques [stat]/Machine Learning [stat.ML]
Statistiques [stat]/Théorie [stat.TH]
Statistiques [stat]/Machine Learning [stat.ML]
English abstract : [en]
The present manuscript mainly focus on cross-validation procedures (and in particular on leave-p-out (LpO)),describing its practical aspects as well as new strategies leading to non-asymptotic theoretical guarantees on ...
Show more >The present manuscript mainly focus on cross-validation procedures (and in particular on leave-p-out (LpO)),describing its practical aspects as well as new strategies leading to non-asymptotic theoretical guarantees on itsstatistical performance (concentration inequalities, oracle inequalities). As a privileged application, cross-validationis also used to address the multiple change-points detection problem in the off-line context. This problem is thentackled in a more general framework by means of reproducing kernels and the model selection paradigm.After introducing the cross-validation procedures in Chapter 1, ongoing strategies allowing us to efficientlycompute cross-validation estimators are detailed in Chapter 2. In particular several of them yield closed-formexpressions for the LpO estimator, which considerably reduces the computational cost. Such closed-form expressionshave been already derived in density estimation with projection and kernel estimators, and with k-nearest neighborsestimators in the regression and binary classification contexts.Chapter 3 discusses the statistical properties of the cross-validation estimators (used as risk estimators) interms of bias, variance, and mean squared error. For instance among cross-validation estimators, it is establishedthat the LpO one enjoys the lowest variance for a given test set cardinality. The leave-one-out (L1O) estimatoris also proved to be asymptotically optimal in terms of mean squared error in density estimation with projectionestimators.Several approaches leading to concentration inequalities of the LpO estimator around its expectation are dis-cussed in Chapter 4. A direct approach relying on the combination of closed-form expressions and the classicalconcentration inequalities of Bernstein and Talagrand is first exposed in the density estimation context. A moregeneral approach is then described which exploits the link between the LpO estimator and U-statistics. Its mainunderlying idea is to deduce exponential concentration results for the LpO estimator from moment inequalities.The derivation of the preliminary results also involve the stability of the used learning algorithm.The important question of model/statiscal algorithm selection is addressed in Chapter 5 in the particular caseof density estimation. The optimality of the LpO-based model selection procedure is proved under some condi-tions both in the estimation purpose—by means of a non-asymptotic oracle inequality—and in the identificationpurpose—through a model consistency result.Cross-validation is then used to tackle the multiple change-points detection problem in the off-line setting, wherethe variance is allowed to vary along the time (heteroscedastic setting). Chapter 6 summarizes the conclusionsdrawn from theoretical as well as empirical results about the behavior of cross-validation procedures. In particular,these conclusions lead us to suggest new model selection procedures relying on cross-validation. At the price ofa higher computational cost, these procedures automatically take into account changes arising in the variance forinstance, which improves the statistical performance. The more general question of detecting changes arising thefull distribution of the observations (and not only in the mean) is also addressed by means of reproducing kernels.A new model selection procedure is designed that is based on a penalty derived in the reproducing kernel Hilbertspace framework. Its non-asymptotic performance is quantified through an oracle inequality with high probability.Numerous aspects of the new procedure are also empirically assessed in the empirical study. For instance, theresults illustrate that the chosen kernel clearly influences the final performance.Finally the manuscript ends with Chapter 7 highlighting several challenging perspectives which could give riseto important improvements both on the practical and theoretical sides.Show less >
Show more >The present manuscript mainly focus on cross-validation procedures (and in particular on leave-p-out (LpO)),describing its practical aspects as well as new strategies leading to non-asymptotic theoretical guarantees on itsstatistical performance (concentration inequalities, oracle inequalities). As a privileged application, cross-validationis also used to address the multiple change-points detection problem in the off-line context. This problem is thentackled in a more general framework by means of reproducing kernels and the model selection paradigm.After introducing the cross-validation procedures in Chapter 1, ongoing strategies allowing us to efficientlycompute cross-validation estimators are detailed in Chapter 2. In particular several of them yield closed-formexpressions for the LpO estimator, which considerably reduces the computational cost. Such closed-form expressionshave been already derived in density estimation with projection and kernel estimators, and with k-nearest neighborsestimators in the regression and binary classification contexts.Chapter 3 discusses the statistical properties of the cross-validation estimators (used as risk estimators) interms of bias, variance, and mean squared error. For instance among cross-validation estimators, it is establishedthat the LpO one enjoys the lowest variance for a given test set cardinality. The leave-one-out (L1O) estimatoris also proved to be asymptotically optimal in terms of mean squared error in density estimation with projectionestimators.Several approaches leading to concentration inequalities of the LpO estimator around its expectation are dis-cussed in Chapter 4. A direct approach relying on the combination of closed-form expressions and the classicalconcentration inequalities of Bernstein and Talagrand is first exposed in the density estimation context. A moregeneral approach is then described which exploits the link between the LpO estimator and U-statistics. Its mainunderlying idea is to deduce exponential concentration results for the LpO estimator from moment inequalities.The derivation of the preliminary results also involve the stability of the used learning algorithm.The important question of model/statiscal algorithm selection is addressed in Chapter 5 in the particular caseof density estimation. The optimality of the LpO-based model selection procedure is proved under some condi-tions both in the estimation purpose—by means of a non-asymptotic oracle inequality—and in the identificationpurpose—through a model consistency result.Cross-validation is then used to tackle the multiple change-points detection problem in the off-line setting, wherethe variance is allowed to vary along the time (heteroscedastic setting). Chapter 6 summarizes the conclusionsdrawn from theoretical as well as empirical results about the behavior of cross-validation procedures. In particular,these conclusions lead us to suggest new model selection procedures relying on cross-validation. At the price ofa higher computational cost, these procedures automatically take into account changes arising in the variance forinstance, which improves the statistical performance. The more general question of detecting changes arising thefull distribution of the observations (and not only in the mean) is also addressed by means of reproducing kernels.A new model selection procedure is designed that is based on a penalty derived in the reproducing kernel Hilbertspace framework. Its non-asymptotic performance is quantified through an oracle inequality with high probability.Numerous aspects of the new procedure are also empirically assessed in the empirical study. For instance, theresults illustrate that the chosen kernel clearly influences the final performance.Finally the manuscript ends with Chapter 7 highlighting several challenging perspectives which could give riseto important improvements both on the practical and theoretical sides.Show less >
Language :
Anglais
Collections :
Source :
Files
- document
- Open access
- Access the document
- HDR_manuscript.pdf
- Open access
- Access the document