The SNOW Theorem

and Latency-Optimal Read-Only Transactions

Scalable storage systems where data is sharded across many machines are now the norm for Web services as their data has grown beyond what a single machine can handle. Consistently reading data across different shards requires transactional isolation for the reads. Yet a Web service may read from its data store hundreds or thousands of times for a single page load and must minimize read latency to keep response times low. Examining the read-only transaction algorithms for many recent academic and industrial scalable storage systems suggests there is a tradeoff between their power—expressed as the consistency they provide and their compatibility with other types of transactions—and their latency.

We show that this tradeoff is fundamental by proving the SNOW Theorem, an impossibility result that states that no read-only transaction algorithm can provide both the lowest latency and the highest power. We then use the tight boundary from the theorem to guide the design of new read-only transaction algorithms for two scalable storage systems, COPS and Rococo. We implement our new algorithms and then evaluate them to demonstrate they provide lower latency for read-only transactions and to understand their impact on overall throughput.

  • Paper Thumbnail
    Paper

    From OSDI 2016

  • Slides Thumbnail
    Slides (pdf version)

    From OSDI 2016

  • Video Thumbnail
    Audio (Haonan Lu)

    From OSDI 2016

  • Code Thumbnail
    COPS-SNOW Code

    On GitHub

  • Code Thumbnail
    Rococo-SNOW Code

    On GitHub

Consolidating Concurrency Control and Consensus

for Commits under Conflicts with Janus

Conventional fault-tolerant distributed transactions layer a traditional concurrency control protocol on top of the Paxos consensus protocol. This approach provides scalability, availability, and strong consistency. When used for wide-area storage, however, this approach incurs cross-data-center coordination twice, in serial: once for concurrency control, and then once for consensus. In this paper, we make the key observation that the coordination required for concurrency control and consensus is highly similar. Specifically, each tries to ensure the serialization graph of transactions is acyclic. We exploit this insight in the design of Janus, a unified concurrency control and consensus protocol.

Janus targets one-shot transactions written as stored procedures, a common, but restricted, class of transactions. Janus can commit unconflicted transactions in this class in one round-trip and avoids aborts due to contention: it commits conflicted transactions in this class in at most two round-trips as long as the network is well behaved and a majority of each server replica is alive. We compare Janus with layered designs and TAPIR under a variety of workloads in this class. Our evaluation shows that Janus achieves 5× the throughput of a layered system and 90% of the throughput of TAPIR under a contention-free microbenchmark. When the workloads become contended, Janus provides much lower latency and higher throughput (up to 6.8×) than the baselines.

  • Paper Thumbnail
    Paper

    From OSDI 2016

  • Slides Thumbnail
    Slides

    From OSDI 2016

  • Audio Thumbnail
    Audio (Shuai Mu)

    From OSDI 2016

  • Code Thumbnail
    Code

    On GitHub

Existential Consistency:

Measuring and Understanding Consistency at Facebook

Replicated storage for large Web services faces a trade-off between stronger forms of consistency and higher performance properties. Stronger consistency prevents anomalies, i.e., unexpected behavior visible to users, and reduces programming complexity. There is much recent work on improving the performance properties of systems with stronger consistency, yet the flip-side of this trade-off remains elusively hard to quantify. To the best of our knowledge, no prior work does so for a large, production Web service.

We use measurement and analysis of requests to Facebook's TAO system to quantify how often anomalies happen in practice, i.e., when results returned by eventually consistent TAO differ from what is allowed by stronger consistency models. For instance, our analysis shows that 0.0004% of reads to vertices would return different results in a linearizable system. This in turn gives insight into the benefits of stronger consistency; 0.0004% of reads are potential anomalies that a linearizable system would prevent. We directly study local consistency models—i.e., those we can analyze using requests to a sample of objects—and use the relationships between models to infer bounds on the others.

We also describe a practical consistency monitoring system that tracks φ-consistency, a new consistency metric ideally suited for health monitoring. In addition, we give insight into the increased programming complexity of weaker consistency by discussing bugs our monitoring uncovered, and anti-patterns we teach developers to avoid.

  • Paper Thumbnail
    Paper

    From SOSP 2015

  • Slides Thumbnail
    Slides

    From SOSP 2015

  • Video Thumbnail
    Video (Haonan Lu)

    From SOSP 2015

RIPQ: Advanced Photo Caching on Flash for Facebook

Facebook uses flash devices extensively in its photo- caching stack. The key design challenge for an efficient photo cache on flash at Facebook is its workload: many small random writes are generated by inserting cache- missed content, or updating cache-hit content for advanced caching algorithms. The Flash Translation Layer on flash devices performs poorly with such a workload, lowering throughput and decreasing device lifespan. Existing coping strategies under-utilize the space on flash devices, sacrificing cache capacity, or are limited to simple caching algorithms like FIFO, sacrificing hit ratios.

We overcome these limitations with the novel Restricted Insertion Priority Queue (RIPQ) framework that supports advanced caching algorithms with large cache sizes, high throughput, and long device lifespan. RIPQ aggregates small random writes, co-locates similarly prioritized content, and lazily moves updated content to further reduce device overhead. We show that two families of advanced caching algorithms, Segmented-LRU and Greedy-Dual-Size-Frequency, can be easily implemented with RIPQ. Our evaluation on Facebook’s photo trace shows that these algorithms running on RIPQ increase hit ratios up to ~20% over the current FIFO system, incur low overhead, and achieve high throughput.

  • Paper Thumbnail
    Paper

    From FAST 2015

  • Slides Thumbnail
    Slides

    From FAST 2015

  • Video Thumbnail
    Video (Linpeng Tang)

    From FAST 2015

  • Proofs Thumbnail
    Theoretical Analysis

    From Jan 2015

  • Blog Thumbnail
    FB Research Blog Post

    From Apr 2016

Extracting More Concurrency from Distributed Transactions

with Rococo

Distributed storage systems run transactions across machines to ensure serializability. Traditional protocols for distributed transactions are based on two-phase locking (2PL) or optimistic concurrency control (OCC). 2PL serializes transactions as soon as they conflict and OCC resorts to aborts, leaving many opportunities for concurrency on the table. This paper presents Rococo, a novel concurrency control protocol for distributed transactions that outperforms 2PL and OCC by allowing more concurrency. Rococo executes a transaction as a collection of atomic pieces, each of which commonly involves only a single server. Servers first track dependencies between concurrent transactions without actually executing them. At commit time, a transaction’s dependency information is sent to all servers so they can re-order conflicting pieces and execute them in a serializable order.

We compare Rococo to OCC and 2PL using a scaled TPC-C benchmark. Rococo outperforms 2PL and OCC in workloads with varying degrees of contention. When the contention is high, Rococo’s throughput is 130% and 347% higher than that of 2PL and OCC.

  • Paper Thumbnail
    Paper

    From OSDI 2014

  • Slides Thumbnail
    Slides

    From OSDI 2014

  • Video Thumbnail
    Video (Shuai Mu)

    From OSDI 2014

  • Code Thumbnail
    Code

    On GitHub

f4: Facebook’s Warm BLOB Storage System

Facebook’s corpus of photos, videos, and other Binary Large OBjects (BLOBs) that need to be reliably stored and quickly accessible is massive and continues to grow. As the footprint of BLOBs increases, storing them in our traditional storage system, Haystack, is becoming increasingly inefficient. To increase our storage efficiency, measured in the effective-replication-factor of BLOBs, we examine the underlying access patterns of BLOBs and identify temperature zones that include hot BLOBs that are accessed frequently and warm BLOBs that are accessed far less often. Our overall BLOB storage system is designed to isolate warm BLOBs and enable us to use a specialized warm BLOB storage system, f4. f4 is a new system that lowers the effective-replication-factor of warm BLOBs while remaining fault tolerant and able to support the lower throughput demands.

f4 currently stores over 65PBs of logical BLOBs and reduces their effective-replication-factor from 3.6 to either 2.8 or 2.1. f4 provides low latency; is resilient to disk, host, rack, and datacenter failures; and provides sufficient throughput for warm BLOBs.

  • Paper Thumbnail
    Paper

    From OSDI 2014

  • Slides Thumbnail
    Slides

    From OSDI 2014

  • Video Thumbnail
    Video (Sabyasachi Roy)

    From OSDI 2014

An Analysis of Facebook Photo Caching

This paper examines the workload of Facebook's photo-serving stack and the effectiveness of the many layers of caching it employs. Facebook’s image-management infrastructure is complex and geographically distributed. It includes browser caches on end-user systems, Edge Caches at ~20 PoPs, an Origin Cache, and for some kinds of images, additional caching via Akamai. The underlying image storage layer is widely distributed, and includes multiple data centers.

We instrumented every Facebook-controlled layer of the stack and sampled the resulting event stream to obtain traces covering over 77 million requests for more than 1 million unique photos. This permits us to study traffic patterns, cache access patterns, geolocation of clients and servers, and to explore correlation between properties of the content and accesses. Our results (1) quantify the overall traffic percentages served by different layers: 65.5% browser cache, 20.0% Edge Cache, 4.6% Origin Cache, and 9.9% Backend storage, (2) reveal that a significant portion of photo requests are routed to remote PoPs and data centers as a consequence both of load-balancing and peering policy, (3) demonstrate the potential performance benefits of coordinating Edge Caches and adopting S4LRU eviction algorithms at both Edge and Origin layers, and (4) show that the popularity of photos is highly dependent on content age and conditionally dependent on the social-networking metrics we considered.

  • Paper Thumbnail
    Paper

    From SOSP 2013

  • Slides Thumbnail
    Slides

    From SOSP 2013

  • Video Thumbnail
    Video (Qi Huang)

    From SOSP 2013

  • Blog Thumbnail
    FB Eng Blog Post

    From Feb 2014

Stronger Semantics for Low-Latency Geo-Replicated Storage

with Eiger

We present the first scalable, geo-replicated storage system that guarantees low latency, offers a rich data model, and provides "stronger" semantics. Namely, all client requests are satisfied in the local datacenter in which they arise; the system efficiently supports useful data model abstractions such as column families and counter columns; and clients can access data in a causally-consistent fashion with read-only and write-only transactional support, even for keys spread across many servers.

The primary contributions of this work are enabling scalable causal consistency for the complex column-family data model, as well as novel, non-blocking algorithms for both read-only and write-only transactions. Our evaluation shows that our system, Eiger, achieves low latency (single-ms), has throughput competitive with eventually-consistent and non-transactional Cassandra (less than 7% overhead for one of Facebook's real-world workloads), and scales out to large clusters almost linearly (averaging 96% increases up to 128 server clusters).

  • Paper Thumbnail
    Paper

    From NSDI 2013

  • Slides Thumbnail
    Slides

    From NSDI 2013

  • Video Thumbnail
    Video

    From NSDI 2013

  • Code Thumbnail
    Code

    On GitHub

Don’t Settle for Eventual:

Causal Consistency for Wide-Area Storage with COPS

Geo-replicated, distributed data stores that support complex online applications, such as social networks, must provide an "always-on" experience where operations always complete with low latency. Today's systems often sacrifice strong consistency to achieve these goals, exposing inconsistencies to their clients and necessitating complex application logic. In this paper, we identify and define a consistency model--causal consistency with convergent conflict handling, or causal+--that is the strongest achieved under these constraints.

We present the design and implementation of COPS, a key-value store that delivers this consistency model across the wide-area. A key contribution of COPS is its scalability, which can enforce causal dependencies between keys stored across an entire cluster, rather than a single server like previous systems. The central approach in COPS is tracking and explicitly checking whether causal dependencies between keys are satisfied in the local cluster before exposing writes. Further, in COPS-GT, we introduce get transactions in order to obtain a consistent view of multiple keys without locking or blocking. Our evaluation shows that COPS completes operations in less than a millisecond, provides throughput similar to previous systems when using one server per cluster, and scales well as we increase the number of servers in each cluster. It also shows that COPS-GT provides similar latency, throughput, and scaling to COPS for common workloads.

  • Paper Thumbnail
    Paper

    From SOSP 2011

  • Slides Thumbnail
    Slides

    From SOSP 2011

  • Video Thumbnail
    Video

    From SOSP 2011

Low-Overhead Transparent Recovery for Static Content

with TRODS

Client connections to web services break when the particular server they are connected to fails or is taken down for maintenance. We designed and built TRODS, a system that transparently recovers connections to web services that delivers static content, e.g., photos or videos. TRODS is implemented as a server-side kernel module for immediate deployability, it works with unmodified services and clients. The key insight in TRODS is its use of cross-layer visibility and control: It derives reliable storage for application-level state from the mechanics of the transport layer. In contrast with more general recovery techniques, the overhead of TRODS is minimal. It provides throughput-per-server competitive with unmodified HTTP services, enabling recovery without additional capital expenditures.

  • Paper Thumbnail
    Paper

    From DSN 2011

  • Slides Thumbnail
    Slides

    From DSN 2011

Using History for High-Throughput Fault Tolerance

with Prophecy

Byzantine fault-tolerant (BFT) replication provides protection against arbitrary and malicious faults, but its performance does not scale with cluster size. We designed and built Prophecy, a system that interposes itself between clients and any replicated service to scale throughput for read-mostly workloads. Prophecy relaxes consistency to delay-once linearizability so it can perform fast, load-balanced reads when results are historically consistent, and slow, replicated reads otherwise. This dramatically increases the throughput of replicated services, e.g., the throughput of a 4 node Prophecy web service is ~4X the throughput of a 4 node PBFT web service.

  • Paper Thumbnail
    Paper

    From NSDI 2010

IP Address Passing for VANETs

In Vehicular Ad-hoc Networks (VANETs), vehicles have short connection times when moving past wireless access points. The time required for acquiring IP addresses via DHCP consumes a significant portion of each connection. We reduce the connection time to under a tenth of a second by passing IP addresses between vehicles. Our implementation improves efficiency, reduces latency, and increases vehicle connectivity without modifying either DHCP or AP software.

  • Paper Thumbnail
    Paper

    From PerComm 2008