CMU-ML-08-117
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-08-117

HADI: Fast Diameter Estimation and Mining
in Massive Graphs with Hadoop

U Kang, Charalampos Tsourakakis, Ana Paula Appel*,
Christos Faloutsos, Jure Leskovec**

December 2008

CMU-ML-08-117.pdf


Keywords: Diameter, graph, hadoop


How can we quickly find the diameter of a petabyte-sized graph? Large graphs are ubiquitous: social networks (Facebook, LinkedIn, etc.), the World Wide Web, biological networks, computer networks and many more. The size of graphs of interest has been increasing rapidly in recent years and with it also the need for algorithms that can handle tera- and peta-byte graphs. A promising direction for coping with such sizes is the emerging map/reduce architecture and its open-source implementation, 'Hadoop'. Estimating the diameter of a graph, as well as the radius of each node, is a valuable operation that can help us spot outliers and anomalies. We propose HADI (HAdoop based DIameter estimator), a carefully designed algorithm to compute the diameters of petabyte-scale graphs. We run the algorithm to analyze the largest public web graph ever analyzed, with billions of nodes and edges. Additional contributions include the following: (a) We propose several performance optimizations (b) we achieve excellent scale-up, and (c) we report interesting observations including outliers and related patterns, on this real graph (116Gb), as well as several other real, smaller graphs. One of the observations is that the Albert et al. conjecture about the diameter of the web is over-pessimistic.

26 pages

*ICMC - USP São Carlos, Brazil
**Computer Science Department, Cornell/Stanford University


SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu