Skip to main content

Performance of Rand’s C statistics in clustering analysis: an application to clustering the regions of Turkey

Abstract

Purpose

When a clustering problem is encountered, the researcher must be aware that choosing an incorrect clustering method and distance measure may significantly affect the results of the analysis. The purpose of this study is to determine the best clustering method and distance measure in cluster analysis and to cluster the regions of Turkey on the basis of this result.

Methods

In hierarchical clustering, there are several clustering methods and distance measures. For comparison of the clustering methods and distance measures, Rand’s C statistic is one of the best methods. Rand’s comparative statistic C takes on values from 0.0 to 1.0 inclusive that may be used to compare two resultant clusterings produced by applying clustering methods to a data set with unknown structure or to assess the performance of a clustering method on a data set with known structure.

Results

In this study, the seven regions of Turkey are clustered by all the clustering methods and distance measures. Related with the social and economic indicators, the final cluster number is taken as three. Then, according to Rand’s C statistics, all possible pairs of distance measures for all clustering methods in hierarchical clustering are compared, and the results are given in the related tables.

Conclusions

According to the results of all possible comparisons, Ward’s method is found to be the best among others, and final clustering of the regions is applied according to Ward’s clustering measure.

1 Introduction

The word ‘classification’ can be used in a broad sense to include various types of diagrams that indicate either the relative degrees of similarities or the lines of descent [1]. The term Cluster Analysis encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. Clustering algorithms are often used to find homogeneous subgroups of entities depicted in a set of data [2].

Cluster analysis divides data into groups (clusters) that are meaningful, useful or both. If meaningful groups are the goal, then the clusters should capture the natural structure of the data. The sample characteristics are used to group the samples. Grouping can be arrived at either hierarchically partitioning the samples or non-hierarchically partitioning the samples. Thus, segmentation methods include probability-based grouping of observations and cluster (grouping)-based observations. They include hierarchical (tree-based method) and non-hierarchical (agglomerative) methods. A good clustering method will produce high quality clusters with high intra-class similarity and low inter-class similarity. Classes, or conceptually meaningful groups of objects that share common characteristics, play an important role in how people analyze and describe the world [3].

A general question that researchers face in many areas of inquiry is how to organize observed data into meaningful structures, that is, how to develop taxonomies. In other words, cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. Given the above, cluster analysis can be used to discover structures in data without providing an explanation/interpretation. In other words, cluster analysis simply discovers structures in data without explaining why they exist [4].

As Rand [5] mentioned in his study, many intuitively appealing methods had been suggested for clustering data; however, interpretation of their results had been hindered by the lack of objective criteria. For this purpose, he developed C statistics which depends on a measure of similarity between two different clusterings of the same set of data, and the measure essentially considers how each pair of data points is assigned to each clustering.

Rand [5] developed a comparative statistic C which takes on values from 0.0 to 1.0 inclusive that may be used to compare two resultant clusterings produced by applying clustering methods to a data set with unknown structure or to assess the performance of a clustering method on a data set with known structure. When C is equal to 1.0, there is a perfect agreement in the comparison. However, the meaning of a C value between 0.0 to 1.0 is not clear. Thus, a means of attaching statistical significance to the values of the C statistic is needed.

Ferreira and Hitchcock [6] compared the performance of four major hierarchical methods (single linkage, complete linkage, average linkage and Ward’s method) for clustering functional data. They used the Rand index to compare the performance of each clustering method.

2 Method

According to Rand [5], the simple computational form for c is, for given N points, X 1 , X 2 ,, X N , and two clusterings of them Y={ Y 1 ,, Y K 1 } and Y ={ Y 1 ,, Y K 2 },

c ( Y , Y ) = i < j N γ i j / ( N 2 ) ,
(1)

where

γ i j ={ 1 if there exist  k  and  k  such that both  X i  and  X j  are in both  Y k  and  Y k , 1 if there exist  k  and  k  such that  X i  is both  Y k  and  Y k  while  X j  is in neither  Y k  and  Y k , 0 otherwise .

For a given pair of clusterings Y and Y of the same N points, arbitrarily number the clusters in each clustering and let n i j be the number of points simultaneously in the i th cluster of Y and the j th cluster of Y . Then the similarity between Y and Y is as follows:

c ( Y , Y ) = [ ( N 2 ) [ ( 1 2 ) { i ( j n i j ) 2 + j ( i n i j ) 2 } n i j 2 ] ] ( N 2 ) .
(2)

In clustering analysis it is known that the most used seven distance measures are squared Euclidean, cityblock, Minkowski, cosine, customized, correlation (Pearson) and Chebychev, and there are seven clustering methods, which are average, centroid, complete, median, single, Ward and weighted method.

2.1 Data set

To apply and see the results of Rand’s C statistics, regions of Turkey are considered with their life expectancy index, education index and income indexes as an illustrative example. The provinces of Turkey are organized into seven census-defined regions, which were originally defined at the First Geography Congress in 1941 [7]. They are CA: Central Anatolia, M: Mediterranean, A: Aegean, MA: Marmara, EA: Eastern Anatolian, SEA: Southeastern Anatolia, BS: Blacksea. Human development index is calculated by considering the life expectancy index, education index and income indexes that express three common characteristics of regions. Human development index is accepted as an important criterion to determine the development levels of the countries [8]. The related index values for each region of Turkey are given in Table 1 (Source: [9]).

Table 1 Life expectancy, education, income and human development indexes for the regions of Turkey

In hierarchical cluster analysis, there are two final clusters at the end. Because it is hard to see the efficiencies of the distance measures and clustering methods for two clusters, the final cluster number is considered as three and the results are observed, whether the regions join the same cluster or not, for all clustering methods and distance measures. Table 2 shows the results of these analyses. For each measure, there are 21 results, which are calculated with all the possible combinations of seven regions with two groups. Then all the clustering methods are compared, and according to these comparisons, together in both, separate in both and mixed groups and Rand’s C statistics are calculated. The results are given in Table 3. Because there are seven distance measures and seven clustering methods, after all the possible combinations, there are 147 results given in Table 3.

Table 2 Clustering results of regions according to all distance measures and clustering methods
Table 3 Comparisons of clustering methods and Rand’s C statistics

3 Results and discussion

According to Table 3, when all the clustering methods and distance measures are examined, mixed results range from one to seven. Related with this result, the Rand’s C statistics, which show the agreement of the distance measures, are 0.67, 0.76, 0.81, 0.86, 0.95 and 1 respectively.

While the distance measure is ‘Squared Euclidean’, the Nearest clustering method is the worst method of all. If the distance measure is considered as ‘Cosine’, the clustering methods Nearest and Within perform a worse result than the other methods. While some consider ‘Pearson correlation’ as a distance measure in hierarchical clustering analysis, the clustering method Between is not as suitable as other clustering methods. As it can be seen from Table 2, when the distance measure is considered as ‘Minkowsky’, the results of clustering methods Nearest and Centroid according to Rand’s statistics are worse than in all other methods. The Nearest clustering method also shows the worst performance for the distance measure ‘Block’. If the distance measure is considered as ‘Customized’, Rand’s C statistics show that Nearest and Centroid clustering methods give the worst performances.

For the results of ‘Chebychev’ distance measure at least for one comparison, all of the clustering results vary for all clustering methods. So, it is really hard to say that any of the clustering methods show better performance than the others.

With respect to these results mentioned above, Ward’s hierarchical clustering method is applied to the data set and the results of the analysis are also given in Figure 1.

Figure 1
figure 1

Dendrogram according to Ward’s clustering method.

According to the dendrogram given in Figure 1, at the first stage of the analysis, while Central Anatolia and Blacksea regions join the same cluster, Mediterranean, Aegean and Marmara regions join the other cluster. They connect to each other at the third stage. At the second stage, East Anatolia and South Eastern Anatolia regions join the same cluster and they combine with the other two clusters at the final stage according to Ward’s hierarchical clustering method.

4 Conclusion

The earlier studies on comparing the clustering methods also confirm the results of this study. For example, in their study Kuiper and Fisher [10] compared six hierarchical clustering procedures. They used the Rand statistics and, according to their results, Ward’s method was best of all. Blashfield [11] used Cohen’s statistics to measure the accuracy of the clustering methods, and according to his results, Ward’s method performed significantly better than the other clustering procedures. Hands and Everitt [12] also compared five hierarchical clustering techniques, and they found that Ward’s method was the better overall than other hierarchical methods. According to Milligan and Cooper [13], Ward’s method gave the best overall recovery. And in their study, Ferreira and Hitchcock [6] compared the performance of four major hierarchical methods according to Rand’s criteria; and as a result, Ward’s method was usually the best.

When there is a clustering problem, the researcher must be aware that choosing a wrong clustering method and distance measure may significantly affect the results of the analysis. For all the results given in related tables in this study, one can consider applying Ward’s or Median clustering methods and keep away from applying the Nearest clustering method for all distance measures to perform a hierarchical clustering analysis to obtain better results according to Rand’s C statistics.

References

  1. Rohlf FJ: Methods of comparing classifications. Ann. Rev. Ecolog. Syst. 1974, 5: 101–103. 10.1146/annurev.es.05.110174.000533

    Article  Google Scholar 

  2. Tarpey T: Clustering functional data. J. Classif. 2003, 20: 93–114. 10.1007/s00357-003-0007-3

    Article  MathSciNet  Google Scholar 

  3. Tan P, Steinbach M, Kumar V: Introduction to Data Mining. Addison-Wesley, Reading; 2005.

    Google Scholar 

  4. Hill T, Lewicki P: STATISTICS: Methods and Applications. StatSoft, Tulsa; 2007.

    Google Scholar 

  5. Rand WM: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66: 846–850. 10.1080/01621459.1971.10482356

    Article  Google Scholar 

  6. Ferreira L, Hitchcock DB: A comparison of hierarchical methods for clustering functional data. Commun. Stat., Simul. Comput. 2009, 38: 1925–1949. 10.1080/03610910903168603

    Article  MathSciNet  Google Scholar 

  7. Yiğit A: The studies of dividing Turkey into regions: an assessment for the past and the future developments. IV Ulusal Coğrafya Sempozyumu, Avrupa Birliği Sürecindeki Türkiye’de Bölgesel Farklılıklar 2006, 33–44.

    Google Scholar 

  8. Saraçlı S, Yılmaz V, Kaygısız Z: Examining the geographical dispersion of the human development index in Turkey with multivariabled statistical techniques. 3 Ulusal Bilgi Ekonomi ve Yönetim Kongresi, Osmangazi University 2004.

    Google Scholar 

  9. UNDP: Human Development Report. www.undp.org/hdro (2000)

    Google Scholar 

  10. Kuiper FK, Fisher LA: A Monte Carlo comparison of six clustering procedures. Biometrics 1975, 31: 777–783. 10.2307/2529565

    Article  Google Scholar 

  11. Blashfield RK: Mixture model tests of cluster analysis: accuracy of four agglomerative hierarchical methods. Psychol. Bull. 1976, 83: 377–388.

    Article  Google Scholar 

  12. Hands S, Everitt B: A Monte Carlo study of the recovery of cluster structure in binary data by hierarchical clustering techniques. Multivar. Behav. Res. 1987, 22: 235–243. 10.1207/s15327906mbr2202_6

    Article  Google Scholar 

  13. Milligan GW, Cooper MC: A study of standardization of variables in cluster analysis. J. Classif. 1988, 5: 181–204. 10.1007/BF01897163

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

Dedicated to Professor Hari M Srivastava.

I would like to thank Professor Dr. İsmet Doğan for support and all statistical help. He is a lecturer in Afyon Kocatepe University, Faculty of Medicine, Department of Biostatistics, Afyonkarahisar/Turkey.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sinan Saraçli.

Additional information

Competing interests

The author declares that he has no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Saraçli, S. Performance of Rand’s C statistics in clustering analysis: an application to clustering the regions of Turkey. J Inequal Appl 2013, 142 (2013). https://doi.org/10.1186/1029-242X-2013-142

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1029-242X-2013-142

Keywords