|   | CMU-CS-97-175 Computer Science Department
 School of Computer Science, Carnegie Mellon University
 
    
     
 CMU-CS-97-175
 
Predicting Data Cache Misses in Non-Numeric 
Applications Through Correlation Profiling 
Todd C. Mowry, Chi-Keung Luk* 
September 1997  
An abbreviated version of this paper will appear in the Proceedings of the Fourth International Symposium on 
High-Performance Computer Architecture, February 1-4, 1998.
 
CMU-CS-97-175.ps Keywords: Caches memories, performance of systems (measurement
techniques, performance attributes), data structures (graphs, lists, trees),
compilers
 Software-based latency tolerance techniques offer the potential for bridging
the ever-increasing speed gap between the memory subsystem and today's
high-performance processors. However, to fully exploit the benefit of these
techniques, one must be careful to apply them only to the dynamic references
that are likely to suffer cache misses --- otherwise the runtime overheads
can potentially offset any gains. In this paper, we focus on isolating 
dynamic miss instances in non-numeric applications, which is a difficult
but important problem. Although compilers cannot statically analyze data 
locality in non-numeric applications, one viable approach is to use profiling 
information to measure the actual miss behavior.  Unfortunately, the 
state-of-the-art in cache miss profiling (which we call summary profiling) is inadequate for references with intermediate miss ratios --- it either 
misses opportunities to hide latency, or else inserts overhead that is 
unnecessary. To overcome this problem, we propose and evaluate a new profiling
technique that helps predict which dynamic instances of a static memory 
reference will hit or miss in the cache: correlation profiling.
 
Our experimental results demonstrate that roughly half of the 22 non-numeric
applications we study can potentially enjoy significant reductions in memory
stall time by exploiting at least one of the three forms of correlation
profiling we consider: control-flow correlation, self correlation, and 
global correlation. In addition, our detailed case studies illustrate that
self correlation succeeds because a given reference's cache outcomes often 
contain repeated patterns, and control-flow correlation succeeds because cache
outcomes are often call-chain dependent. We also demonstrate that software 
prefetching can achieve better performance on a modern superscalar processor 
when directed by correlation profiling rather than summary profiling 
information. 
26 pages 
*Department of Computer Science, University of Toronto, 
Toronto, Ontario, Canada, M5S 3G4
 |