
CMUCS00105
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMUCS00105
Accelerating Exact kmeans Algorithms with Geometric
Reasoning
Dan Pelleg, Andrew Moore
January 2000
CMUCS00105.ps
CMUCS00105.pdf
Keywords: Computational geometry, classification, denisity
estimation, kdtrees, clustering, Kmeans
We present new algorithms for the kmeans clustering problem. They use the
kdtree data structure to reduce the large number of nearestneighbor
queries issued by the traditional algorithm. Sufficient statistics are
stored in the nodes of the kdtree. Then, an analysis of the geometry of
the current cluster centers results in great reduction of the work needed to
update the centers. Our algorithms behave exactly as the traditional
kmeans algorithm. Proofs of correctness are included. The kdtree can
also be used to initialize the kmeans starting centers efficiently. Our
algorithms can be easily extended to provide fast ways of computing the
error of a given cluster assignment, regardless of the method in which those
clusters were obtained. We also show how to use them in a setting which
allows approximate clustering results, with the benefit of running faster.
We have implemented and tested our algorithms on both real and simulated
data. Results show a speedup factor of up to 170 on real astrophysical
data, and superiority over the naive algorithm on simulated data in up to
5 dimensions. Our algorithms scale well with respect to the number of
points and number of centers, allowing for clustering with tens of
thousands of centers.
23 pages
