CMU-CS-16-124
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-16-124

Mining Large Multi-Aspect Data: Algorithms and Applications

Evangelos E. Papalexakis

August 2016

Ph.D. Thesis

CMU-CS-16-124.pdf


Keywords: Data mining, multi-aspect, multi-modal, matrix, tensor, tensor decomposition, tensor factorization, PARAFAC, Tucker, CORCONDIA, scalability, unsupervised learning, unsupervised analysis, exploratory analysis, brain data analysis, social network analysis, web search, ParCube, Turbo-SMT, Paracomp, AutoTen

What does a person's brain activity look like when they read the word apple? How does it differ from the activity of the same (or even a different person) when reading about an airplane? How can we identify parts of the human brain that are active for different semantic concepts? On a seemingly unrelated setting, how can we model and mine the knowledge on web (e.g., subject-verb-object triplets), in order to find hidden emerging patterns? Our proposed answer to both problems (and many more) is through bridging signal processing and large-scale multi-aspect data mining.

Specifically, language in the brain, along with many other real-word pro- cesses and phenomena, have different aspects, such as the various semantic stimuli of the brain activity (apple or airplane), the particular person whose activity we analyze, and the measurement technique. In the above exam- ple, the brain regions with high activation for "apple" will likely differ from the ones for "airplane". Nevertheless, each aspect of the activity is a signal of the same underlying physical phenomenon: language understanding in the human brain. Taking into account all aspects of brain activity results in more accurate models that can drive scientific discovery (e.g, identifying semantically coherent brain regions).

In addition to the above Neurosemantics application, multi-aspect data appear in numerous scenarios such as mining knowledge on the web, where different aspects in the data include entities in a knowledge base and the links between them or search engine results for those entities, and multi-aspect graph mining, with the example of multi-view social networks, where we observe social interactions of people under different means of communication, and we use all aspects of the communication to extract communities more accurately.

The main thesis of our work is that many real-world problems, such as the aforementioned, benefit from jointly modeling and analyzing the multi-aspect data associated with the underlying phenomenon we seek to uncover. In this thesis we develop scalable and interpretable algorithms for mining big multi- aspect data, with emphasis on tensor decomposition. We present algorithmic advances on scaling up and parallelizing tensor decomposition and assessing the quality of its results, that have enabled the analysis of multi-aspect data that the state-of-the-art could not support. Indicatively, our proposed methods speed up the state-of-the-art by up to two orders of magnitude, and are able to assess the quality for 100 times larger tensors. Furthermore, we present results on multi-aspect data applications focusing on Neurosemantics and Social Networks and the Web, demonstrating the effectiveness of multiaspect modeling and mining. We conclude with our future vision on bridging Signal Processing and Data Science for real-world applications.

278 pages

Thesis Committee:
Christos Faloutsos (Chair)
Tom Mitchell
Jeff Schneider
Nicholas D. Sidiropoulos (University of Minnesota)

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science



Return to: SCS Technical Report Collection
School of Computer Science