Computer Science Department
School of Computer Science, Carnegie Mellon University


Anomaly Detection in Large Social Graphs

Neil Shah

October 2017

Ph.D. Thesis


Keywords: Fraud detection, anomaly detection, graph mining, data mining, machine learning, user behavior, clustering, network science, social science, social network analysis, plain graphs, dynamic graphs, rich graphs, unsupervised learning, scalability, interpretability, fBox, TimeCrunch, FLOCK, EdgeCentric, M3A, DeltaCon-Attr, ConDeNSe

Given the ever-growing prevalence of online social services, leveraging massive datasets has become an increasingly important challenge for businesses and end-users alike. Online services capture a wealth of information about user behavior and platform interactions, such as who-follows-whom relationships in social networks and who-rates-what-and-when relationships in e-commerce networks. Since many of these services rely on data-driven algorithms to recommend content to their users, authenticity of user behavior is paramount to success. But given anonymity on the internet, how do we know which users and actions are real or not? This thesis focuses on this problem and introduces new techniques to effectively and efficiently discern anomalous and fraudulent behavior in online social graphs. Specifically, we work on three thrusts: plain graphs, dynamic graphs and rich graphs.

Firstly, we focus on plain graphs, in which only static connectivity information is known. We detail several proposed algorithms spanning the topics of spectral fraud detection in a single graph, blame attribution between graph snapshots, and structurally diverse graph summarization. Our fBox algorithm in [SBGF14] identifies link fraudsters in social networks with over 93% precision and identifies hundreds of thousands of fake accounts, many of which were yet unsuspended.

Next, we broaden our scope to dynamic graphs, in which we leverage connectivity information over a span of time. Many online interactions are timestamped, and thus time and interarrival time between user actions are powerful features which can be used to discern abnormal behavior. We describe multiple relevant works which describe how to identify and summarize anomalous temporal graph structures, model interarrival time patterns in user queries to find anomalous search behavior, and identify “group” anomalies comprising of users acting in lockstep. Our FLOCK approach in [Sha17] is the first to tackle the viewbot problem on livestreaming platforms, and finds astroturfed broadcasts and views with over 90% precision and near-perfect recall.

Lastly, we expand our reach to rich graphs, in which connectivity information is supplemented by other attributes, such as time, rating, number of messages sent, etc. Rich graphs are common in practice, as online services routinely track many aspects of user behavior to gain multifaceted insights. Multimodal views of data are useful in identifying various types of anomalies in different subspaces. To this end, we propose works which focus on ranking anomalies in edge-attributed graphs, and characterizing multimodality of online link fraud. Our EdgeCentric approach in [SBH+16] uncovers rating patterns in e-commerce datasets and pinpoints fake reviewers with 87% precision at Flipkart.

The techniques described in this thesis span various disciplines including data mining, machine learning, network and social sciences and information theory and are practically applicable to a number of real-world fraud and general anomaly detection scenarios. They are carefully designed to attain high precision and recall in practice and scale to massive datasets, including social networks, telecommunication networks, e-commerce and collaboration networks with up to millions of nodes and billions of edges.

245 pages

Thesis Committee:
Christos Faloutsos (Chair)
Roni Rosenfeld
Jaime Carbonell
Jiawei Han (University of Illinois at Urbana-Champaign)

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by