LANE CENTER SCIENCE TECHNICAL REPORT ABSTRACTS

CMU-CB-11-104
Lane Center for Computational Biology
School of Computer Science, Carnegie Mellon University

CMU-CB-11-104

Modeling the Space of Subcellular Location Patterns
using Images and other Sources of Information

Luis Pedro Coelho

September 2011

Ph.D. Thesis

Keywords: Subcellular proteomics, bioimage informatics, topic models, semi-supervised learning, local image features, cell nucleus segmentation, fluorescent microscopy.

The study of proteins includes the study of protein location as one of its major areas of interest. This study can be approached one protein at a time, or systematically, in a high-throughput fashion, an approach that has been called location proteomics.

Subcellular location can either be predicted, based on the protein sequence, homology, or other circumstantial evidence such as interaction patterns; or determined by direct observation.

The prediction approach has the advantage that it requires less data (sometimes only the sequence). On the other hand, its results are not as conclusive as those obtained from direct data. Furthermore, prediction is, at least with the most widely used techniques, obtained from static data (sequence, functional annotations, binding patterns,...). Thus, most systems will predict the same location independently of cell type or cell state.

Direct data is normally in the form of images of fluorescently labeled proteins. The automatic analysis of such images has by now a decade long history. Most of the work has been done in the supervised learning mode: the researcher specifies a set of interesting location classes (corresponding to the organelles of interest), finds a few examples of each, and trains a classifier to recognise them in unlabeled data. Some work has shown usage of methods for the problem. In this approach, the different proteins are clustered together into an hierarchy or a set of groups.

This work shows that direct and indirect data can be combined into a single model and inferences can be made which depend on all of it. In particular, the model can project multiple modalities into the same space and return a label which is based on all its input data.

I will also propose new image representations for use with subcellular location images. They are adapted from Speeded-Up Robust Features (SURF), but adapted to the setting where, in addition to the protein channel, a reference channel (in the case under study, a DNA marker) is present. I will use supervised classification as a validation problem and show that SURF outperforms traditional approaches and that adding DNA information outperforms traditional SURF.

140 pages

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu