Improving Wide-Area Data Store Availability and Responsiveness with Crux

By Cristina Basescu, EPFL

Wide-area data stores enhance fault tolerance and response times by replicating their load on multiple machines; and they enhance scalability by distributing their load across machines. For users, they are an available and responsive black box. Yet network partitions can paralyze service interactions even between clients who are on the same side of the partition. Even in the absence of failures, these wide-area deployments can degrade the interaction latency between nearby clients significantly more than the network limits between them. We present Crux, the first general framework that addresses these issues by building locality-preserving data stores. Crux introduces Available Responsive Areas (ARAs) for transforming an existing scalable data-store system into a new locality-preserving one that has two properties. First, ARAs provide fine-grained boundaries to tolerate network partitions. Second, for any two clients interacting via Crux, ARAs bound their worst-case latency to a small multiple of their network latency. We apply Crux to two popular key-value stores, without any changes to their source code: Redis, an eventually consistent system, and CockroachDB, a strongly consistent system. For both, Crux shows in both cases significant latency and availability improvements for localized interactions. Our experiments indicate that Crux is effective, incurs a linear overhead, and is applicable to a variety of existing data stores.

Wednesday, June 19th, 2019 @16:15 room BC 410 (map)