|   | CMU-CS-00-105 Computer Science Department
 School of Computer Science, Carnegie Mellon University
 
    
     
 CMU-CS-00-105
 
Accelerating Exact k-means Algorithms with Geometric 
Reasoning 
Dan Pelleg, Andrew Moore 
January 2000  
CMU-CS-00-105.psCMU-CS-00-105.pdf
 Keywords: Computational geometry, classification, denisity
estimation, kd-trees, clustering, K-means
 We present new algorithms for the k-means clustering problem. They use the
kd-tree data structure to reduce the large number of nearest-neighbor
queries issued by the traditional algorithm. Sufficient statistics are
stored in the nodes of the kd-tree. Then, an analysis of the geometry of
the current cluster centers results in great reduction of the work needed to
update the centers. Our algorithms behave exactly as the traditional
k-means algorithm. Proofs of correctness are included. The kd-tree can
also be used to initialize the k-means starting centers efficiently. Our
algorithms can be easily extended to provide fast ways of computing the
error of a given cluster assignment, regardless of the method in which those
clusters were obtained. We also show how to use them in a setting which
allows approximate clustering results, with the benefit of running faster.
 
We have implemented and tested our algorithms on both real and simulated
data. Results show a speedup factor of up to 170 on real astrophysical
data, and superiority over the naive algorithm on simulated data in up to
5 dimensions. Our algorithms scale well with respect to the number of
points and number of centers, allowing for clustering with tens of
thousands of centers. 
23 pages 
 |