|   | CMU-ISRI-04-107 Institute for Software Research International
 School of Computer Science, Carnegie Mellon University
 
    
     
 CMU-ISRI-04-107
 
Learning Semantically Robust Rules from Data 
Yiheng Li, Latanya Sweeney 
February 2004  
Also appears asCenter for Automated Learning and Discovery
 Technical Report CMU-CALD-04-100
 
Also associated with the Data Privacy Laboratory 
CMU-ISRI-04-107.pdf Keywords: Association rules, rule learning, hierarchical learning, data mining
 We introduce the problem of mining robust rules, which are expressive 
multi-dimensional generalized association rules.  Consider a large 
relational table, where associated with each attribute is a hierarchy 
whose base values are those originally represented in the data, and 
values appearing at higher levels in the hierarchy represent increasingly 
more general concepts of base values.  Attribute hierarchies provide 
meaningful levels of concept aggregation, such as the encoding of postal 
codes (ZIP) or dates, or the taxonomy of products.  We find the least 
general rules formed by combining mixed levels of generalizations across 
attributes to convey the maximum expression of information supported by 
attribute hierarchies, parameter settings and data tuples.  We term these
"robust rules"and introduce a GenTree algorithm as a means to learn 
robust rules from a table.  An example of a robust rule from a table 
having base values {5-digit ZIP, gender, registration date (year/month/day),
party} might be "women living in Cambridge (021**) and registered in the 
1970s (197*/xx/xx) tend to be Democrats." Previous studies on mining 
generalized association rules have been limited dimensionally (e.g., 
transactional data), by data type (e.g., quantitative data), and/or to 
rules expressed from either fixed-level or non-semantic abstractions. 
Such approaches limit the kinds of rules that can be learned. 
Experiments using GenTree with two real-world datasets, containing 
10,000 six-attributed tuples and over 4,000 eight-attributed tuples 
each, show that learned rules convey more comprehensive information 
than possible with traditional association rule mining algorithms, 
because traditional approaches limit the expressivity of the rules 
they generate.
 
20 pages 
 |