Search for a command to run...
This study addresses the task of clustering multi-label image collections, which is increasingly important in fields such as forensics, social media, and intelligence. Traditional classification models fall short in real-world scenarios where labeled data may not be available. Unsupervised clustering is a way to move forward in such cases. Clustering of multi-label data should minimize the number of clusters for an analyst to identify all instances of a specific label, ensuring cluster efficiency, while also reducing misplaced data within each cluster to improve cluster quality. Existing clustering algorithms applied to multi-label image collections generally have a strong emphasis on either cluster efficiency or cluster quality. We propose a Post-Clustering Merging algorithm that provides greater control over cluster efficiency vs quality in multi-label image collections, that can be applied on the results of existing clustering algorithms. We introduce two external metrics designed for multi-label clustering: Pairwise Jaccard Similarity Score and Label Distribution Score. These metrics enable a nuanced evaluation of clustering quality and efficiency, respectively, in scenarios where single-label metrics are inadequate. We demonstrate its effectiveness on various multi-label image collections. The results indicate significant improvements, not only giving more control, but also reducing the trade-off between cluster quality and efficiency. This study fills a gap in multi-label data collection analysis and sets a foundation for future exploration in this domain. • Two novel metrics allow evaluation of multi-label image collection clustering. • Merge clusters based on similarities and dissimilarities to increase performance. • Our method has proven effectiveness on several multi-label image collections. • Our merge method with k-means outperforms state-of-the-art deep clustering.
Published in: Expert Systems with Applications
Volume 288, pp. 127875-127875