|   | CMU-ISR-08-131R Institute for Software Research
 School of Computer Science, Carnegie Mellon University
 
    
     
 CMU-ISR-08-131R
 
Looking under the Hood of Stochastic Machine LearningAlgorithms for Parts of Speech Tagging
 
Jana Diesner, Kathleen M. Carley 
July 2008  
Center for the Computational Analysis ofSocial and Organizational Systems (CASOS) Technical Report
 
Supercedes CMU-ISR-08-131 
CMU-ISR-08-131R.pdf Keywords: Part of speech tagging, hidden Markov models, Viterbi
algorithm, AutoMap
 A variety of Natural Language Processing and Information Extraction tasks, 
such as question answering and named entity recognition, can benefit from 
precise knowledge about a words' syntactic category or Part of Speech (POS) 
(Church, 1988; Rabiner, 1989; Stolz, Tannenbaum, & Carstensen, 1965). POS t
aggers are widely used to assign a single best POS to every word in text data, 
with stochastic approaches achieving accuracy rates of up to 96%
to 97% (Jurafsky & Martin, 2000). When building a POS tagger, human beings 
need to make a set of choices about design decisions, some of which 
significantly impact the accuracy and other performance aspects of the 
resulting engine. However, documentations of POS taggers often leave these 
choices and decisions implicit. In this paper we provide an overview on
some of these decisions and empirically determine their impact on POS tagging 
accuracy. The gained insights can be a valuable contribution for people who 
want to design, implement, modify, fine-tune, integrate, or responsibly use 
a POS tagger. We considered the results presented herein in building and 
integrating a POS tagger into AutoMap, a tool that facilitates
relation extraction from texts, as a stand-alone feature as well as an 
auxiliary feature for other tasks.
 
35 pages 
 |