CMU-HCII-25-102
Human-Computer Interaction Institute
School of Computer Science, Carnegie Mellon University



CMU-HCII-25-102

Intreactive Data Profiling

Will Epperson

June 2025

Ph.D. Thesis

CMU-HCII-25-102.pdf


Keywords: Data Science Tools, Data Visualization, Artificial Intelligence, Machine Learning, Interactive Data Science, Exploratory Data Analysis, Data Profiling, Data Quality, Tabular Datasets, Text Datasets


Data has been a key driver behind recent advances in science, engineering, and artificial intelligence. As datasets have grown larger and more complex, the primary bottleneck has shifted from access to data towards the human effort required to interpret it. Human expertise is essential to understand datasets, however generating this understanding during analysis remains a time-consuming and manual process. Many AI modeling failures are, at their core, data problems—issues that might have been addressed earlier with better tools for understanding the data. Data visualization facilitates understanding through visual representations, however existing approaches to visual data exploration introduce friction that slows users down, requiring manually defining charts and interactions through code or context switching to a new analysis tool. How can we build flexible and lightweight systems to help people more quickly understand their data?

This thesis develops systems for Interactive Data Profiling that accelerate data exploration through a fast feedback loop between interactive interfaces and data programming workflows. We first motivate this problem through a large-scale interview study and survey of data scientists that reveals the potential for tools to help users manage the repetitive code used for data profiling. We then discuss the design, implementation, and evaluation of three systems that develop the approach of interactive data profiling. First, we describe AUTOPROFILER, a system that augments programming environments with automatic data profiles that show summaries of the data in memory and update as a user programs. We then extend this approach with SOLAS which tracks the history of a user's analysis code to create data profiles adapted to the current task and user interest. User evaluations demonstrate how the lightweight visualizations and fast feedback loops enabled by these systems help users quickly identify important patterns and data quality issues. Finally, we present TEXTURE, a general-purpose text exploration tool that enables users to iterate on attributes for describing their text and then explore results in the interactive UI. Expert user studies show how TEXTURE enables more efficient exploration and helps users uncover new insights from their text datasets.

Together, these tools establish how to situate interactive data profiling within data science workflows to enable a fast feedback loop between manipulating data and inspecting the results. As data remains an increasingly important component of modern work, interactive data profiling systems can play a critical role in enabling faster, more reliable understanding of the data behind models and decisions.

120 pages

Thesis Committee:
Adam Perer (Co-Chair)
Dominik Moritz (Co-Chair, HCII/Apple)
Sherry Tongshuang Wu
Ankit Kittur
Gagan Bansal (Microsoft Research)

Brad A. Myers, Head, Human-Computer Interaction Institute
Martial Hebert, Dean, School of Computer Science



Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu