Abstract:The purpose of this paper is to introduced a new clustering methodology. This paper is divided into three parts. In the first part we have developed the axiomatic theory for the average silhouette width (ASW) index. There are different ways to investigate the quality and characteristics of clustering methods such as validation indices using simulations and real data experiments, model-based theory, and non-model-based theory known as the axiomatic theory. In this work we have not only taken the empirical approach of validation of clustering results through simulations, but also focus on the development of the axiomatic theory. In the second part we have presented a novel clustering methodology based on the optimization of the ASW index. We have considered the problem of estimation of number of clusters and finding clustering against this number simultaneously. Two algorithms are proposed. The proposed algorithms are evaluated against several partitioning and hierarchical clustering methods. An intensive empirical comparison of the different distance metrics on the various clustering methods is conducted. In the third part we have considered two application domains\textemdash novel single cell RNA sequencing datasets and rainfall data to cluster weather stations.
Abstract:In this paper, we propose a unified clustering approach that can estimate number of clusters and produce clustering against this number simultaneously. Average silhouette width (ASW) is a widely used standard cluster quality index. We define a distance based objective function that optimizes ASW for clustering. The proposed algorithm named as OSil, only, needs data observations as an input without any prior knowledge of the number of clusters. This work is about thorough investigation of the proposed methodology, its usefulness and limitations. A vast spectrum of clustering structures were generated, and several well-known clustering methods including partitioning, hierarchical, density based, and spatial methods were consider as the competitor of the proposed methodology. Simulation reveals that OSil algorithm has shown superior perform in terms of clustering quality than all clustering methods included in the study. OSil can find well separated, compact clusters and have shown better performance for the estimation of number of clusters than several methods. Apart from the proposal of the new methodology and it's investigation this papers offer a systematic analysis on the estimation of cluster indices, some of which never appeared together in comparative simulation setup before. The study offers many insightful findings useful for the selection of the clustering methods and indices.
Abstract:An agglomerative hierarchical clustering (AHC) framework and algorithm named HOSil based on a new linkage metric optimized by the average silhouette width (ASW) index is proposed. A conscientious investigation of various clustering methods and estimation indices is conducted across a diverse verities of data structures for three aims: a) clustering quality, b) clustering recovery, and c) estimation of number of clusters. HOSil has shown better clustering quality for a range of artificial and real world data structures as compared to k-means, PAM, single, complete, average, Ward, McQuitty, spectral, model-based, and several estimation methods. It can identify clusters of various shapes including spherical, elongated, relatively small sized clusters, clusters coming from different distributions including uniform, t, gamma and others. HOSil has shown good recovery for correct determination of the number of clusters. For some data structures only HOSil was able to identify the correct number of clusters.