CMU-ML-08-107
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-08-107

Data Analysis Project: Leveraging Massive
Textual Corpora Using n-Gram Statistics

Andrew Carlson, Tom M. Mitchell, Ian Fette*

May 2008

CMU-ML-08-107.pdf


Keywords: Machine learning, spelling correction, ontology population, large data sets


We study methods of efficiently leveraging massive textual corpora through n-gram statistics.Specifically, we explore algorithms that use a database of frequency counts for sequences of tokens in a terawordWeb corpus to correct spelling mistakes and to extract a list of instances of some category given only the name of the target category. For spelling correction, we use a novel correction algorithm and demonstrate high accuracy in correcting both real-word errors and non-word errors. For category extraction, we show promising preliminary results for a variety of categories. We conclude that n-gram statistics provide an efficient way to use information contained in a massive corpus of text using relatively simple algorithms. The report ends with a reflection on problems encountered, possible solutions, and future work.

31 pages

*Institute for Software Research, Carnegie Mellon University


SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu