Model-based multivariate discretization ...
Type de document :
Autre communication scientifique (congrès sans actes - poster - séminaire...): Communication dans un congrès sans actes
URL permanente :
Titre :
Model-based multivariate discretization for logistic regression
Auteur(s) :
Erhardt, Adrien [Auteur]
Biernacki, Christophe [Auteur]
Vandewalle, Vincent [Auteur]
METRICS : Evaluation des technologies de santé et des pratiques médicales - ULR 2694
Evaluation des technologies de santé et des pratiques médicales - ULR 2694 [METRICS]
Heinrich, Philippe [Auteur]
Laboratoire Paul Painlevé - UMR 8524
Laboratoire Paul Painlevé - UMR 8524 [LPP]
Biernacki, Christophe [Auteur]
Vandewalle, Vincent [Auteur]
METRICS : Evaluation des technologies de santé et des pratiques médicales - ULR 2694
Evaluation des technologies de santé et des pratiques médicales - ULR 2694 [METRICS]
Heinrich, Philippe [Auteur]
Laboratoire Paul Painlevé - UMR 8524
Laboratoire Paul Painlevé - UMR 8524 [LPP]
Titre de la manifestation scientifique :
Data Science Summer School
Ville :
Paris
Pays :
France
Date de début de la manifestation scientifique :
2017-08-28
Date de publication :
2017-08-28
Mot(s)-clé(s) :
Discretization
Quantization
Grouping
Logistic regression
Preprocessing
Credit scoring
Discrétisation
Regroupement
Régression logistique
Scoring
Risque de crédit
Quantization
Grouping
Logistic regression
Preprocessing
Credit scoring
Discrétisation
Regroupement
Régression logistique
Scoring
Risque de crédit
Discipline(s) HAL :
Statistiques [stat]/Machine Learning [stat.ML]
Résumé :
Credit institutions are interested in the refunding probability of a loan given the applicant's characteristics in order to assess the worthiness of the credit. For regulatory and interpretability reasons, the logistic ...
Lire la suite >Credit institutions are interested in the refunding probability of a loan given the applicant's characteristics in order to assess the worthiness of the credit. For regulatory and interpretability reasons, the logistic regression is still widely used to learn this probability from the data. Although logistic regression handles naturally both quantitative and qualitative data, two pre-processing steps are usually performed: first, continuous features are discretized by assigning factor levels to pre-determined intervals; second, qualitative features, if they take numerous values, are regrouped into variables taking fewer factor levels. In this communication focus will be given on the discretization of continuous variables which is performed for two main reasons: first, it produces a \u201Cscorecard\u201D with a direct correspondence from intervals to score \u201Cpoints\u201D; second, it allows do deal with non linearity of the score with respect to the continuous variables. There already exists many discretization algorithms (see the review from Ramírez\u2010Gallego et al. (2016)). To the best of our knowledge, the few multivariate supervised algorithms are unsatisfactory in our setup mainly because they are not fully automated, their optimized criterion does not produce suitable discretized features for logistic regression and their approach are empirical. By reinterpreting discretized features as latent variables, we are able, through the use of a Stochastic Expectation-Maximization (SEM) algorithm and a Gibbs sampler, to overcome those shortcomings and to find the best discretization scheme w.r.t. the logistic regression loss. The good performances of this approach are illustrated on simulated and real data from Crédit Agricole Consumer Finance.Lire moins >
Lire la suite >Credit institutions are interested in the refunding probability of a loan given the applicant's characteristics in order to assess the worthiness of the credit. For regulatory and interpretability reasons, the logistic regression is still widely used to learn this probability from the data. Although logistic regression handles naturally both quantitative and qualitative data, two pre-processing steps are usually performed: first, continuous features are discretized by assigning factor levels to pre-determined intervals; second, qualitative features, if they take numerous values, are regrouped into variables taking fewer factor levels. In this communication focus will be given on the discretization of continuous variables which is performed for two main reasons: first, it produces a \u201Cscorecard\u201D with a direct correspondence from intervals to score \u201Cpoints\u201D; second, it allows do deal with non linearity of the score with respect to the continuous variables. There already exists many discretization algorithms (see the review from Ramírez\u2010Gallego et al. (2016)). To the best of our knowledge, the few multivariate supervised algorithms are unsatisfactory in our setup mainly because they are not fully automated, their optimized criterion does not produce suitable discretized features for logistic regression and their approach are empirical. By reinterpreting discretized features as latent variables, we are able, through the use of a Stochastic Expectation-Maximization (SEM) algorithm and a Gibbs sampler, to overcome those shortcomings and to find the best discretization scheme w.r.t. the logistic regression loss. The good performances of this approach are illustrated on simulated and real data from Crédit Agricole Consumer Finance.Lire moins >
Langue :
Anglais
Audience :
Internationale
Vulgarisation :
Non
Établissement(s) :
CHU Lille
CNRS
Université de Lille
CNRS
Université de Lille
Date de dépôt :
2020-06-08T14:10:27Z