|   | CMU-CS-03-159 Computer Science Department
 School of Computer Science, Carnegie Mellon University
 
    
     
 CMU-CS-03-159
 
AutoPar: Automating Schema Design for LargeScientific Databases Using Data Partitioning
 
Efstratios Papadomanolakis, Anastassia Ailamaki 
July 2003  
CMU-CS-03-159.psCMU-CS-03-159.pdf
 Keywords: Relational databases, performance, self-tuning,
vertical partitioning
 Database applications that use multi-terabyte datasets are becoming
increasingly important for scientific fields such as astronomy and biology.
Scientific databases are particularly suited for the application of
automated physical design techniques, because of their data volume and the
complexity of the scientific workloads. Current automated physical design
tools focus on the selection of indexes and materialized views. In
large-scale scientific databases, however, the data volume and the
continuous insertion of new data allows for only limited indexes and
materialized views. By contrast, data partitioning does not replicate data,
thereby reducing space requirements and minimizing update overhead. In this
paper we propose AutoPart, an algorithm that automatically partitions
database tables to optimize sequential access assuming prior knowledge of a
representative workload. The resulting schema is indexed using a fraction of
the space required for indexing the original schema. To evaluate AutoPart,
we build an automated schema design tool that interfaces to commercial
database systems. We experiment with AutoPart in the context of the Sloan
Digital Sky Survey database, a real-world astronomical database, running on
SQL Server 2000. Our experiments corroborate the benefits of partitioning
for large-scale systems: Partitioning alone improves query execution
performance by a factor of two on average. Combined with indexes, the new
schema also outperforms the indexed original schema by 20% (for queries) and
a factor of five (for updates), while using only half the original index
space.
 
15 pages 
 |