CMU-CS-07-149
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-07-149

Incremental Pattern Discovery on
Streams, Graphs and Tensors

Jimeng Sun

December 2007

Ph.D. Thesis

CMU-CS-07-149.pdf


Keywords: Data mining, stream mining, incremental learning, clustering, tensor

Incremental pattern discovery targets streaming applications where the data continuously arrive incrementally. The questions are how to find patterns (main trends) incrementally; or how to efficiently update the old patterns when new data arrive; or how to utilize the patterns to solve other problems such as anomaly detection?

As examples, 1) a sensor network monitors a large number of distributed streams (such as temperature and humidity); 2) network forensics monitor the Internet communication patterns to identify attacks; 3) cluster monitoring examines the system behaviors of a number of machines for potential failures; 4) social network analysis monitors a dynamic graph for communities and abnormal individuals; 5) financial fraud detection tries to find fraudulent activities from a large number of transactions.

We first investigate a powerful data model, tensor stream (TS), where there is one tensor per timestamp. To capture diverse data formats, we have a zero-order TS for a single time-series (e.g., the stock price for Google over time), a first-order TS for multiple time-series (sensor measurement streams), a second-order TS for a matrix (graphs), and a high-order TS for a multiarray (Internet communication network, source-destination-port). Second, we develop different online algorithms on TS: 1) the centralized and distributed SPIRIT for mining a 1st-order TS, as well as its extensions for local correlation function and privacy preservation; 2) the compact matrix decomposition (CMD) and GraphScope for a 2nd-order TS; 3) the dynamic tensor analysis (DTA), streaming tensor analysis (STA) and window-based tensor analysis (WTA) for a high-order TS. All the techniques are extensively evaluated for real applications such as network forensics, cluster monitoring.

In particular, this CMD achieves orders of magnitude improvements in space and time over the previous state of the art, and identifies interesting anomalies. GraphScope detects interesting communities and change-points on several time-evolving graphs such as Enron email graph and another network traffic flow graph. DTA, STA and WTA are all online methods for higherorder data that scale well with time, provide fundamental tradeoffs with each other, which have also been applied to a number of applications, such as social network community tracking, anomaly detection in data centers and network traffic monitoring.

201 pages


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu