CMU-ISR-17-109 Institute for Software Research School of Computer Science, Carnegie Mellon University
The Utility of Corporate Comparison Geoffrey P. Morgan July 2017
Center for the Computational Analysis of Social and Organizational Systems
Delete Lists are lists of words that have been determined to have little useful meaning for textual analysis. One subset of words that are frequently deleted are stop-words. Stop-Words are textual tokens, such as "and", "a", or "the", that provide structural or grammatical impact to a sentence but do not themselves have significant inherent meaning. Identifying stop-words is a routine process in most text-cleaning applications, but frequently is done via user-maintained word lists. I suggest that the corpora comparison technique I devised for word-score polarization can be used to identify low-value words while preserving the bulk of the text tokens. I will use both known and random draw corpora comparisons for this process. By "known" corpora, I mean corpora drawn from explicit data-sources, the emails of one company and the emails of another, for example. "Random-Draw" corpora are created by drawing document sets at random, and therefore this technique could be applied to any sufficiently large text corpus of interest. I use the ability to identify stop words as a proxy for performance in generating useful delete lists. Random-Draw and Known Corpora Comparison techniques outperform an iteration of TF-IDF (Term Frequency - Inverse Document Frequency), which performs quite poorly on this email data.
16 pages
| |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |