MACHINE LEARNING TECHNICAL REPORT ABSTRACTS

	CMU-ML-08-107 Machine Learning Department School of Computer Science, Carnegie Mellon University CMU-ML-08-107 Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics Andrew Carlson, Tom M. Mitchell, Ian Fette* May 2008 CMU-ML-08-107.pdf Keywords: Machine learning, spelling correction, ontology population, large data sets We study methods of efficiently leveraging massive textual corpora through n-gram statistics.Specifically, we explore algorithms that use a database of frequency counts for sequences of tokens in a terawordWeb corpus to correct spelling mistakes and to extract a list of instances of some category given only the name of the target category. For spelling correction, we use a novel correction algorithm and demonstrate high accuracy in correcting both real-word errors and non-word errors. For category extraction, we show promising preliminary results for a variety of categories. We conclude that n-gram statistics provide an efficient way to use information contained in a massive corpus of text using relatively simple algorithms. The report ends with a reflection on problems encountered, possible solutions, and future work. 31 pages *Institute for Software Research, Carnegie Mellon University

SCS Technical Report Collection School of Computer Science homepage This page maintained by reports@cs.cmu.edu