CMU-CS-17-110
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-17-110

Communities and Anomaly Detection in
Large Edge-Labeled Graphs

Miguel Araújo

May 2017

Ph.D. Thesis

CMU-CS-17-110.pdf


Keywords: Graph Mining, Time-Evolving Networks, Community Detection, Anomaly Detection, Tensor Factorization

The identification of anomalies and communities of nodes in real-world graphs has applications in widespread domains, from the automatic categorization of wikipedia articles or websites to bank fraud detection. While recent and ongoing research is supplying tools for the analysis of simple unlabeled data, it is still a challenge to find patterns and anomalies in large labeled datasets such as time evolving networks. What do real communities identified in big networks look like? How can we find sources of infections in bipartite networks? Can we predict who is most likely to join an online discussion on a given topic?

We model interaction networks using appropriate matrix and tensor representations in order to gain insights into these questions. We incorporate edge attributes, such as timestamps in phone-call networks or airline codes in flight networks, and complex node side-information, such as who-retweets-whom in order to understand who uses a given hashtag on Twitter. We provide three major contributions:

  1. Hyperbolic communities: Our exploration of real communities provides novel theoretical ideas regarding their structure, and we show how typical long-tail degree distributions can be leveraged to create efficient algorithms on problems that seem to suffer from quadratic explosion.
  2. Anomaly detection: Novel distributed algorithms that identify problematic nodes whose influence can only be detected on their neighbors, validated through the analysis of data breaches in bank transactions.
  3. Forecasting: New techniques that forecast network evolution by incorporating contextual side-information and the evolution of independent community structures.
Leveraging these techniques, we explore massive datasets such as networks with billions of credit card transactions, Twitter graphs with over 300 million interactions and phone-call networks with over 50 million phone-calls.

194 pages

Thesis Committee:
Christos Faloutsos (Co-Chair)
William Cohen
Aarti Singh
Pedro Ribeiro (Co-Chair, University of Porto)
Tina Eliassi-Rad (Northeastern University)
Beatriz Santos (University of Aveiro)
Alexandre Francisco (University of Lisbon)

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science




Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu