COMPUTER SCIENCE TECHNICAL REPORT ABSTRACTS

CMU-CS-17-112
Computer Science Department
School of Computer Science, Carnegie Mellon University

CMU-CS-17-112

Modeling and Analyzing Systems with Redundancy

Kristen Gardner

May 2017

Ph.D. Thesis

Keywords: Redundancy, replication, task assignment, dispatching, scheduling, queueing theory, stochastic processes, Markov models, RIQ, Redundancy-d, LRF, PF

Reducing latency is a primary concern in computer systems. As cloud computing and resource sharing become more prevalent, the problem of how to reduce latency becomes more challenging because there is a high degree of variability in server speeds. Recent computer systems research has shown that the same job can take 12 times or even 27 times longer to run on one machine than another, due to varying background load, garbage collection, network contention, and other factors. This server variability is transient and unpredictable, making it hard to know how long a job will take to run on any given server, and therefore how best to dispatch and schedule jobs.

An increasingly popular strategy for combating server variability is redundancy. The idea is to create multiple copies of the same job, dispatch these copies to different servers, and wait for the first copy to complete service. A great deal of empirical computer systems research has demonstrated the benefits of redundancy: using redundancy can yield up to a 50% reduction in mean response time. Unfortunately, there is very little theoretical work analyzing performance in systems with redundancy.

This thesis presents the first exact analysis of response time in systems with redundancy. We begin in the Independent Runtimes (IR) model, in which a job's service times (runtimes) are assumed to be independent across servers. Here we derive exact expressions for the distribution of response time in a certain set of classbased redundancy systems. We also propose two new scheduling policies, Least Redundant First (LRF) and Primaries First (PF), and prove that LRF minimizes overall system response time, while PF is fair across classes of jobs with different redundancy degrees.

While the IR model is appropriate in certain settings, in others it does not make sense because the independence assumption eliminates any notion of an "inherent job size." The IR model leads to the conclusion that more redundancy is always better, which often is not true in practice. Therefore we propose the S&X model, which is the first model to decouple a job's inherent size (X) from the server slowdown (S). This model is important because, unlike prior models, it allows a job's runtimes to be correlated across servers. The S&X model makes it evident that redundancy does not always help: in fact, too much redundancy can lead to instability. To overcome this, we propose a new dispatching policy, Redundant-to-Idle-Queue, which is provably stable in the S&X model, while offering substantial response time improvements compared to systems without redundancy

170 pages

Thesis Committee:
Mor Harchol-Balter (Chair)
Guy Blelloch
Anupam Gupta
Alan Scheller-Wolf (Tepper)
Rhonda Righter (University of California, Berkeley)

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu