Cahier 2010-02

Titre :	Classification de variables qualitatives autour de variables latentes
Résumé :	En classification, on sâintÃ©resse habituellement Ã classifier les observations et non les variables. Cependant la classification de variables trouve tout son sens en rÃ©duction de dimension, pour la sÃ©lection de variables ou encore dans certaines applications (analyse sensorielle, biochimie, marketing, etc.). L'idÃ©e est alors de chercher des groupes de variables liÃ©es c'est-Ã -dire porteuses de la mÃªme information. Une fois que les variables sont organisÃ©es en groupes homogÃšnes telles que les variables au sein dâune mÃªme classe sont similaires, il est alors possible de sÃ©lectionner dans chaque classe une variable ou de rÃ©sumer chaque classe de variables par une variable synthÃ©tique, encore appelÃ©e variable latente. Plusieurs approches ont Ã©tÃ© spÃ©cifiquement dÃ©veloppÃ©es pour la classification de variables quantitatives. Cependant, pour des donnÃ©es qualitatives, peu de mÃ©thodes ont Ã©tÃ© proposÃ©es. Dans cet article, nous Ã©tendons le critÃšre proposÃ© par Vigneau et Qannari (2003) dans leur mÃ©thode CLV (Â« Clustering around Latent Variables Â») pour la classification de variables quantitatives au cas de donnÃ©es qualitatives. La variable latente d'une classe maximise l'homogÃ©nÃ©itÃ© de la classe, dÃ©finie comme la somme des rapports de corrÃ©lation entre les variables qualitatives de la classe et cette variable latente quantitative. Nous montrons que cette variable latente peut Ãªtre obtenue par une Analyse des Correspondances Multiples des variables de la classe. Plusieurs algorithmes de classification utilisant le mÃªme critÃšre d'homogÃ©nÃ©itÃ© sont alors dÃ©finis : algorithme de type nuÃ©es dynamiques, classification hiÃ©rarchique ascendante et descendante. Enfin ces diffÃ©rentes approches sont utilisÃ©es dans une Ã©tude de cas rÃ©elle concernant la satisfaction de navigants plaisanciers.
Mot(s) clé :	classification de variables qualitatives, rapport de corrÃ©lation, algorithme des nuÃ©es dynamiques, classification hiÃ©rarchique
Title:	Clustering of categorical variables around latent variables
Abstract:	In the framework of clustering, the usual aim is to cluster observations and not variables. However the issue of variable clustering clearly appears for dimension reduction, selection of variables or in some case studies (sensory analysis, biochemistry, marketing, etc.). Clustering of variables is then studied as a way to arrange variables into homogeneous clusters, thereby organizing data into meaningful structures. Once the variables are clustered into groups such that variables are similar to the other variables belonging to their cluster, the selection of a subset of variables is possible. Several specific methods have been developed for the clustering of numerical variables. However concerning categorical variables, much less methods have been proposed. In this paper we extend the criterion used by Vigneau and Qannari (2003) in their Clustering around Latent Variables approach for numerical variables to the case of categorical data. The homogeneity criterion of a cluster of categorical variables is defined as the sum of the correlation ratio between the categorical variables and a latent variable, which is in this case a numerical variable. We show that the latent variable maximizing the homogeneity of a cluster can be obtained with Multiple Correspondence Analysis. Different algorithms for the clustering of categorical variables are proposed: iterative relocation algorithm, ascendant and divisive hierarchical clustering. The proposed methodology is illustrated by a real data application to satisfaction of pleasure craft operators.
Keyword(s):	clustering of categorical variables, correlation ratio, iterative relocation algorithm, hierarchical clustering
Auteur(s) :	JÃ©rome SARACCO (GREThA UMR CNRS 5113), Marie CHAVENT (IMB UMR CNRS 5251), Vanessa KUENTZ (IMB UMR CNRS 5251)
JEL Class.:	C49 ; C69
Télécharger le cahier Retour à la liste des Cahier du GRETHA (2010)