CMU-CS-15-120
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-15-120

Grounded Knowledge Bases for Scientific Domains

Dana Movshovitz-Attias

August 2015

Ph.D. Thesis

CMU-CS-15-120.pdf


Keywords: Grounded language learning, natural language processing, knowledge base construction, knowledge representation, statistical language modeling, unsupervised learning, semi supervised learning, bootstrapping, topic modeling, machine learning, probabilistic graphical models, ontology, grounding, information extraction.

This thesis is focused on building knowledge bases (KBs) for scientific domains. Specifically, we create structured representations of technical-domain information using unsupervised or semi-supervised learning methods. This work is inspired by recent advances in knowledge base construction based on Web text. However, in the technical domains we consider here, in addition to text corpora we have access to the objects named by text entities, as well as data associated with those objects. For example, in the software domain, we consider the implementation of classes in code repositories, and observe the way they are being used in programs. In the biomedical realm, biological ontologies define interactions and relations between domain entities, and there is experimental information on entities such as proteins and genes. We consider the process of grounding, namely, linking entity mentions from text to external domain resources, including code repositories and biomedical ontologies, where objects can be uniquely identified. Grounding presents an opportunity for learning, not only how entities are discussed in text, but also what are their real-world properties.

The main contribution of this thesis is in addressing challenges from the following research areas, in the context of learning about technical domains: (1) Knowledge representation: How should knowledge about technical domains be represented and used? (2) Grounding: How can existing resources of technical domains be used in learning? (3) Applications: What applications can benefit from structured knowledge bases dedicated to scientific data?

We explore grounded learning and knowledge base construction for the biomedical and software domains. We first discuss approaches for improving applications based on well-studied statistical language models. Next, we construct a deeper semantic representation of domain-entities by building a grounded ontology, where entities are linked to a code repository, and through an adaption of an ontology-driven KB learner to scientific input. Finally, we present a topic model framework for knowledge base construction, which jointly optimizes the KB schema and learned facts, and show that this framework produces high precision KBs in our two domains of interest. We discuss extensions to our model that allow: first, incorporating human input, leading to a semi-supervised learning process, and second, grounding the modeled entities with domain data.

147 pages

Thesis Committee:
William W. Cohen (Chair)
Tom Mitchell
Roni Rosenfeld
Alon Halevy (Google Research)

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science



Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu