MODIFICATION OF CHF AND BIC COEFFICIENTS FOR EVALUATION OF CLUSTERING WITH MIXED TYPE VARIABLES

  • Tomas Loster University of Economics, Prague
Keywords: Cluster Analysis, Evaluation Of Clustering, BIC Criterion, CHF Criterion

Abstract

Current literature draws attention particularly to the evaluation of clustering in a situation when individual objects are characterized only by quantitative variables. The problems associated with the analysis of data characterized by qualitative or mixed type variables have only been dealt with to a limited extent. This is based on an analogy of the techniques applied when evaluating log-linear models for example.

In this paper I suggest new coefficients for the evaluation of resulting clusters based on the principle of the variability analysis. Furthermore, only coefficients for mixed type variables based on a combination of sample variance and one of the variability measures for nominal variables will be presented. Similar approaches can be applied in the case of qualitative variables while omitting the part characterizing the variability of quantitative variables.

In this paper I evaluated selected indices for determining the number of clusters when objects are characterized by mixed type variables too. On the basis of real data files analyses (Database The UCI Machine Learning Repository website: http://archive.ics.uci.edu/ml/datasets.html) I compared three newly proposed indices with the known BIC criterion, which is is implemented in two-step cluster analysis in the IBM SPSS Statistics system. I knew the number of object groups and I was interested in agreement of the found optimal number of clusters with the real number of groups. I had analyzed 15 data files and it was found that new indices determined the correct number of clusters more successful than BIC criterion which is is implemented in two-step cluster analysis in the IBM SPSS Statistics system. Criterions based on Gini coefficient were more successful than criterion based on Entropy.

The CHFG index determined the correct number of clusters in most cases (93.33 %). The second successful criterion was the CHFH index (73.33 %). The BIC criterion determines the correct number of clusters in 40.0 % of cases and my modification of BIC criterion (using Gini coefficient instead of entropy, which i

References

Calinski, T., Harabasz, J.: A dendrite method for cluster analysis, Comunications in Statistics, Vol. 3, 1974, 1–27.

Gan, G., Ma, C., Wu, J.: Data Clustering Theory, Algorithms, and Applications. ASA, Philadelphia, 2007.

Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering Algorithms and Validity Measures. SSDBM, Athens, 2001.

ŘEHÁK, J., ŘEHÁKOVÁ, B.: Analýza kategorizovaných dat v sociologii, Academia, Praha, 1986.

ŘEZANKOVÁ, H., HÚSEK, D., LÖSTER, R.: Clustering with Mixed Type Variables and Determination of Cluster Numbers, CNAM and INRIA, Paříž, 2010, s. 1525-1532.

ŘEZANKOVÁ, H., HÚSEK, D., SNÁŠEL, V.: Shluková analýza dat,

vydání, Professional Publishing, Praha, 2009.

ŘEZANKOVÁ, H., HÚSEK, D.: Methods for the determination of the number of clusters in statistical software packages, VŠE KSTP; VŠE KMIE, Praha, 2008, s. 1-6.

UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets.html

Published
2013-12-15
Section
Articles