Scalable k-means++

2012628 citationsJournal Article

Authors

Benjamin Moseley · University of Illinois Urbana-Champaign

Andrea Vattani · University of California, San Diego

Sergei Vassilvitskii · Research!America (United States)

Abstract

Over half a century old and showing no signs of aging, k -means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k -means is crucial for obtaining a good final solution. The recently proposed k -means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k -means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k -means that have mostly focused on the post-initialization phases of k -means. We prove that our proposed initialization algorithm k -means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k -means|| outperforms k -means++ in both sequential and parallel settings.

Topics & Keywords

Data Management and Algorithms Data Stream Mining Techniques Anomaly Detection Techniques and Applications

Publication Details

Published in: Proceedings of the VLDB Endowment

Volume 5, Issue 7, pp. 622-633

DOI: 10.14778/2180912.2180915

Field-Weighted Citation Impact: 21.40

Command Palette

Scalable k-means++

Authors

Abstract

Topics & Keywords

Publication Details