|
CMU-CS-25-119 Computer Science Department School of Computer Science, Carnegie Mellon University
Explainable Mining of Graphs and Time Series: Meng-Chieh Lee Ph.D. Thesis June 2025
Given a social network graph, how can we predict connections between users and determine whether they are based on shared hobbies or common friends? Given a database containing molecular graphs, how can we determine whether the graphs inhibit HIV replication based on substructures they frequently share? Similarly, in time series data from EEG recording, how can we identify seizures and explain why they are considered abnormal? Although recent machine learning methods have shown improved performance, many remain black-box models, making explainability challenging. This leads us to explainable artificial intelligence (XAI), which offers valuable insights through its explanations and is more practical for deployment in real-world applications. In this thesis, we focus on developing explainable machine learning methods tailored for graphs and time series. Each method we propose is either inherently explainable, or designed to automatically provide data analysis or justification for its decisions. In each part, we present effective and general algorithms, and explore a broad range of applications. In the first part, we focus on node-level graph mining. We propose algorithms to analyze various types of information in graphs, e.g., the network effects of the graph structure, and the usable information in the node features. Our proposed linear methods are not only inherently interpretable and fast, but also outperform baselines in solving node classification and link prediction tasks. In node classification, our method improves the accuracy by 10.3% over the second-best baseline, while being 2.5 times faster. In link prediction, our method achieves an average rank 1.1, outperforming baselines on 11 out of 12 real-world datasets. In the application of graph retrieval-augmented generation, our agentic method achieves an average relative improvement of 51%. In the second part, we focus on graph-level graph mining. We discover frequent substructures using the minimum description length (MDL) principle and learnable graph kernels. In graph anomaly detection, our MDL-based method is 58 times faster than the second-best baseline, while achieving 1.3 times higher average precision. In graph regression, our method with learnable graph kernels improves the mean absolute error by 14.3%. In the application of human trafficking detection, our method detects human trafficking ads with 84% precision, while requiring only 8 hours to process 4 million documents. In the third part, we focus on time series mining, with an emphasis on time series anomaly detection. Our self-supervised method effectively identifies the ground truth hyperparameters of anomalies in time series data, resulting in an average rank 2.2 compared to the baselines. In the application involving medical EEG signals, unlike traditional point anomaly detection methods, we focus on identifying group anomalies that occur within a short period and exhibit similar abnormal patterns. Our method is fast and scalable, and discovers and ranks both point and group anomalies in 2 minutes for 1 million data points on a stock machine. 257 pages
Thesis Committee:
Srinivasan Seshan, Head, Computer Science Department
|
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |