Computer Science Department
School of Computer Science, Carnegie Mellon University


Supporting HybridWorkloads for In-Memory Database Management
Systems via a Universal Columnar Storage Format

Tianyu Li

M.S. Thesis

May 2019


Keywords: Database Systems, Apache Arrow

The proliferation of modern data processing ecosystems has given rise to open-source columnar data formats. The key advantage of these formats is that they allow organizations to load data from database management systems (DBMSs) once instead of having to convert it to a new format for each usage. These formats, however, are read-only. This means that organizations must still use a heavy-weight transformation process to load data from their original format into the desired columnar format. We aim to reduce or even eliminate this process by developing an in-memory storage management architecture for transactional DBMSs that is aware of the eventual usage of its data and operates directly on columnar storage blocks. We introduce relaxations to common analytical format requirements to efficiently update data, and rely on a lightweight in-memory transformation process to convert blocks back to analytical forms when they are cold. We also describe how to directly access data from third-party analytical tools with minimal serialization overhead. To evaluate our work, we implemented our storage engine based on the Apache Arrow format and integrated it into the CMDB DBMS. Our experiments show that our approach achieves comparable performance with dedicated OLTP DBMSs while also enabling orders of magnitude faster data exports to external data science and machine learning libraries than existing approaches.

74 pages

Thesis Committee:
Andrew Pavlo (Chair)
David G. Andersen

Srinivasan Seshan, Head, Computer Science Department
Tom M. Mitchell, Interim Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by