MODIFICATION OF CHF AND BIC COEFFICIENTS FOR EVALUATION OF CLUSTERING WITH MIXED TYPE VARIABLES
Abstract
Current literature draws attention particularly to the evaluation of clustering in a situation when individual objects are characterized only by quantitative variables. The problems associated with the analysis of data characterized by qualitative or mixed type variables have only been dealt with to a limited extent. This is based on an analogy of the techniques applied when evaluating log-linear models for example.
In this paper I suggest new coefficients for the evaluation of resulting clusters based on the principle of the variability analysis. Furthermore, only coefficients for mixed type variables based on a combination of sample variance and one of the variability measures for nominal variables will be presented. Similar approaches can be applied in the case of qualitative variables while omitting the part characterizing the variability of quantitative variables.
In this paper I evaluated selected indices for determining the number of clusters when objects are characterized by mixed type variables too. On the basis of real data files analyses (Database The UCI Machine Learning Repository website: http://archive.ics.uci.edu/ml/datasets.html) I compared three newly proposed indices with the known BIC criterion, which is is implemented in two-step cluster analysis in the IBM SPSS Statistics system. I knew the number of object groups and I was interested in agreement of the found optimal number of clusters with the real number of groups. I had analyzed 15 data files and it was found that new indices determined the correct number of clusters more successful than BIC criterion which is is implemented in two-step cluster analysis in the IBM SPSS Statistics system. Criterions based on Gini coefficient were more successful than criterion based on Entropy.
The CHFG index determined the correct number of clusters in most cases (93.33 %). The second successful criterion was the CHFH index (73.33 %). The BIC criterion determines the correct number of clusters in 40.0 % of cases and my modification of BIC criterion (using Gini coefficient instead of entropy, which i
References
Calinski, T., Harabasz, J.: A dendrite method for cluster analysis, Comunications in Statistics, Vol. 3, 1974, 1–27.
Gan, G., Ma, C., Wu, J.: Data Clustering Theory, Algorithms, and Applications. ASA, Philadelphia, 2007.
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering Algorithms and Validity Measures. SSDBM, Athens, 2001.
ŘEHÁK, J., ŘEHÁKOVÁ, B.: Analýza kategorizovaných dat v sociologii, Academia, Praha, 1986.
ŘEZANKOVÁ, H., HÚSEK, D., LÖSTER, R.: Clustering with Mixed Type Variables and Determination of Cluster Numbers, CNAM and INRIA, Paříž, 2010, s. 1525-1532.
ŘEZANKOVÁ, H., HÚSEK, D., SNÁŠEL, V.: Shluková analýza dat,
vydání, Professional Publishing, Praha, 2009.
ŘEZANKOVÁ, H., HÚSEK, D.: Methods for the determination of the number of clusters in statistical software packages, VŠE KSTP; VŠE KMIE, Praha, 2008, s. 1-6.
UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets.html
Copyright information
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Creative Commons Attribution License 3.0 - CC BY 3.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).