COMPUTER SCIENCE TECHNICAL REPORT ABSTRACTS

CMU-CS-18-102
Computer Science Department
School of Computer Science, Carnegie Mellon University

CMU-CS-18-102

Methods for Reducing Unnecessary Computation
on False Mappings in Read Mapping

Hongyi Xin

Ph.D. Thesis

February 2018

Keywords: Read Mapping, Approximate String Matching, Seed-and-Extend, Landau-Vishkin, Error Tolerance, String Algorithms, Suffix Trie, Edit Distance, Affine Gap

Advancements in sequencing technology have brought a large increase in the quantity of raw sequencing data. To cope with the ever growing sequence data sets, efficient read mappers are developed to accurately reconstruct entire genomes from fragments, by mapping fragments of the genome of an organism to a high quality reference genome. For each fragment, a mapper tries to find a distinct corresponding highly similar sequence in the reference, or multiple highly similar sequences if applicable. Despite improvements of the mapping software, modern state-of-the-art mappers struggle between speed and sensitivity: as a mapper increases sensitivity to tolerate more sequencer errors and genetic variations, it inevitably enlarges the search space and spend more time comparing fragments against increasingly dissimilar reference sequences. Comparisons over such false mappings are often unnecessary and wasteful, as dissimilar reference sequences are not considered as meaningful mappings. Furthermore, although only a small fraction of all DNA fragments require high mapping sensitivity due to large number of embedded errors, mappers have to raise sensitivity against all fragments, as the error profile of each fragment is unknown to the mapper.

In this dissertation, we provide multiple algorithms and implementations that aim to reduce the amount of unnecessary computation spent on false mappings while achieving high sensitivity by 1) quickly and accurately filtering out false mappings and 2) reducing the total number of false mappings without sacrificing sensitivity through improved seeding mechanisms. Specifically, we designed SIMD-friendly algorithms that quickly identify false mappings with high accuracy. We also extended a previously proposed approximate string matching algorithm to better suit biological applications. Finally, we developed multiple methods that enhance the popular seed-and-extend mapping strategy by increasing seed selectivity without reducing mapping sensitivity. From our experiments, we showcase inefficiencies of naïve seeding mechanisms in state-of-the-art mappers. We show that our filters can achieve high filtering accuracies while spending only a fraction of the computational cost. We further show that our improved seed selection methods are highly effective in reducing total number of false mappings. With these algorithms and methods, we provide a set of tools available for future read mappers to be more precise when mapping reads to the reference.

Thesis Committee:
Carl Kingsford (Chair)
Jian Ma
Phillip Gibbons
Iman Hajirasouliha (Weill Cornell Medical College)
Bill Bolosky (Microsoft Research)

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science

128 pages

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu