Model-based clustering of categorical data ...
Type de document :
Autre communication scientifique (congrès sans actes - poster - séminaire...): Communication dans un congrès avec actes: Conférence invitée
Titre :
Model-based clustering of categorical data by relaxing conditional independence
Auteur(s) :
Marbac, Matthieu [Auteur]
Department of Mathematics & Statistics [Halmiton]
Biernacki, Christophe [Auteur]
Laboratoire Paul Painlevé - UMR 8524 [LPP]
MOdel for Data Analysis and Learning [MODAL]
Vandewalle, Vincent [Auteur]
Evaluation des technologies de santé et des pratiques médicales - ULR 2694 [METRICS]
MOdel for Data Analysis and Learning [MODAL]
Department of Mathematics & Statistics [Halmiton]
Biernacki, Christophe [Auteur]
Laboratoire Paul Painlevé - UMR 8524 [LPP]
MOdel for Data Analysis and Learning [MODAL]
Vandewalle, Vincent [Auteur]
Evaluation des technologies de santé et des pratiques médicales - ULR 2694 [METRICS]
MOdel for Data Analysis and Learning [MODAL]
Titre de la manifestation scientifique :
Classification Society Meeting
Organisateur(s) de la manifestation scientifique :
Mc Master University
Ville :
Hamilton, Ontario
Pays :
Canada
Date de début de la manifestation scientifique :
2015-06-03
Date de publication :
2015-06-05
Discipline(s) HAL :
Mathématiques [math]
Mathématiques [math]/Statistiques [math.ST]
Mathématiques [math]/Statistiques [math.ST]
Résumé en anglais : [en]
In model-based clustering, each cluster is modelled by a parametrised probability distribution function (pdf). In the multivariate quantitative data setting many pdf are available (Gaussian, Student, ...) and allow to take ...
Lire la suite >In model-based clustering, each cluster is modelled by a parametrised probability distribution function (pdf). In the multivariate quantitative data setting many pdf are available (Gaussian, Student, ...) and allow to take into account correlations between variables inside a cluster. In the multivariate qualitative data setting, there is no natural multivariate pdf. Consequently, the variables are usually supposed independent given the cluster, this model is also called latent class model. The latent class model allows to take into account the main data heterogeneity and often produces good partitions in practice. However, it can suffer from severe bias when variables are correlated inside clusters resulting in a bad partition and often to an over-estimation of the number of clusters.In this talk we will present two parsimonious extensions of the latent class model which relax the cluster conditional independence assumption. In these two models, variables are grouped into independent blocks given the cluster, each block following a parsimonious and interpretable distribution. The first model supposes that the block distribution in a cluster is a mixture of two extreme distributions, which are respectively the independence and the maximum dependency. The second model supposes that the block distribution in a cluster is a parsimonious multinomial distribution where the few free parameters correspond to the most likely modality crossings, while the remaining probability mass is uniformly spread over the other modality crossings. On both cases, parameters are estimated by maximum likelihood using the EM algorithm. The difficult issue of block structure search is solved by a specific MCMC algorithm for each model. When the variables are dependent given the class, these models allow to reduce the biases of the latent class model and in particular to select a more accurate number of clusters.Lire moins >
Lire la suite >In model-based clustering, each cluster is modelled by a parametrised probability distribution function (pdf). In the multivariate quantitative data setting many pdf are available (Gaussian, Student, ...) and allow to take into account correlations between variables inside a cluster. In the multivariate qualitative data setting, there is no natural multivariate pdf. Consequently, the variables are usually supposed independent given the cluster, this model is also called latent class model. The latent class model allows to take into account the main data heterogeneity and often produces good partitions in practice. However, it can suffer from severe bias when variables are correlated inside clusters resulting in a bad partition and often to an over-estimation of the number of clusters.In this talk we will present two parsimonious extensions of the latent class model which relax the cluster conditional independence assumption. In these two models, variables are grouped into independent blocks given the cluster, each block following a parsimonious and interpretable distribution. The first model supposes that the block distribution in a cluster is a mixture of two extreme distributions, which are respectively the independence and the maximum dependency. The second model supposes that the block distribution in a cluster is a parsimonious multinomial distribution where the few free parameters correspond to the most likely modality crossings, while the remaining probability mass is uniformly spread over the other modality crossings. On both cases, parameters are estimated by maximum likelihood using the EM algorithm. The difficult issue of block structure search is solved by a specific MCMC algorithm for each model. When the variables are dependent given the class, these models allow to reduce the biases of the latent class model and in particular to select a more accurate number of clusters.Lire moins >
Langue :
Anglais
Comité de lecture :
Non
Audience :
Internationale
Vulgarisation :
Non
Collections :
Source :
Fichiers
- slides.pdf
- Accès libre
- Accéder au document