|
CMU-CS-25-103 Computer Science Department School of Computer Science, Carnegie Mellon University
Event Monitoring in Modern Public Health Data Streams Ananya Joshi Ph.D. Thesis March 2025
Detecting individual data sequences corresponding to actionable events in large-scale, dynamic data streams, also known as data monitoring, is a challenging computational problem with applications across multiple domains. Specifically in public health, these data sequences can correspond to events like outbreaks or quality issues directly impacting downstream decision-making and outbreak response efforts. However, as the volume of public health-related data continues to grow, traditional machine learning algorithms for anomaly or event detection, designed for smaller datasets, become increasingly ineffective – for example, by outputting tens of thousands of uninformative alerts that lead to reviewer fatigue. These challenges are exacerbated by the noise, non-stationarity, and incompleteness of public health data and hinder the ability of domain experts to perform data monitoring. My thesis enables domain experts to monitor large-scale data streams via novel ranked-list based algorithms that address the question, "Which data should be examined first, and why?" In contrast to traditional approaches that use statistical alerts, the output list of the top-ranked data prioritizes data reviewers' attention so that they remain engaged with the algorithmic outputs. These underlying algorithms, designed to be simple, scalable, and generalizable, include (1) ranking outliers from limited-history, nonstationary, noisy data streams with weekday effects, (2) reranking extreme outlier data points across large streams, and (3) ranking top anomalous subsequences of any length from dynamic, partially observed data without sampling. Evaluating these algorithms and the overall approach in offline and deployed settings show strong results. For instance, when paired with custom user interfaces, the approach enabled a 53-fold increase in monitoring efficiency for data reviewers performing data monitoring at the Delphi Group at Carnegie Mellon University for over two years, allowing them to detect over 200 noteworthy data issues from 15 million new data points each week. This monitoring approach directly supports efficient and accurate public health surveillance and can readily be deployed at the state, national, or international level to enhance the effectiveness of public health data-driven decision-making and the core algorithms can be relevant to other critical monitoring domains. 103 pages
Thesis Committee:
Srinivasan Seshan, Head, Computer Science Department
|
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |