TTL-MSR Taiming Tail-Latency for Microsecond-scale RPCs

Jan, 2019 → Dec, 2021

Partner: Microsoft
Partner contact: Irene Zhang, Dan Ports, Marios Kogias
EPFL laboratory: Data Center Systems Laboratory (DCSL)
EPFL contact: Prof. Edouard Bugnion, Konstantinos Prasopoulos

The deployment of a web-scale application within a datacenter can comprise of hundreds of software components, deployed on thousands of servers organized in multiple tiers and interconnected by commodity Ethernet switches. These versatile components communicate with each other via Remote Procedure Calls (RPCs) with the cost of an individual RPC service typically measured in microseconds. The end-user performance, availability and overall efficiency of the entire system are largely dependent on the efficient delivery and scheduling of these RPCs. Yet, these RPCs are ubiquitously deployed today on top of general-purpose transport protocols such as TCP. We propose to make RPC first-class citizens of datacenter deployment. This requires a revisitation of the overall architecture, application API, and network protocols. Our research direction is based on a novel RPC-oriented protocol, R2P2, which separates control flow from data flow and provides in-networking scheduling opportunities to tame tail latency. We are also building the tools that are necessary to scientifically evaluate microsesecond-scale services.

Topics:Digital Information