Readings in distributed systems
This post is a work in progress.
Inspired by a recent purchase of the Red Book, which provides a curated list of important papers around database systems, I’ve decided to begin assembling a list of important papers in distributed systems. Similar to the Red Book, I’ve broken each group of papers out into a series of categories, each highlighting a progression of related ideas over time focused in a specific area of research within the field.
Keeping the tradition of the Red Book, I’ve included both papers which resulted in very successful systems and/or techniques, as well as papers which introduced a concept which was either immediately dismissed or proven incorrect. This emphasizes the progression of ideas which lead to the development of these systems.
Consensus
The problems of establishing consensus in a distributed system.
- In Search of an Understandable Consensus Algorithm 2013
- A Simple Totally Ordered Broadcast Protocol 2008
- Paxos Made Live - An Engineering Perspective 2007
- The Chubby Lock Service for Loosely-Coupled Distributed Systems 2006
- Paxos Made Simple 2001
- Impossibility of Distributed Consensus with One Faulty Process 1985
- The Byzantine Generals Problem 1982
Consistency
Types of consistency, and practical solutions to solving ensuring atomic operations across a set of replicas.
- Highly Available Transactions: Virtues and Limitations 2013
- Consistency Tradeoffs in Modern Distributed Database System Design 2012
- CAP Twelve Years Later: How the “Rules” Have Changed 2012
- Calvin: Fast Distributed Transactions for Partitioned Database Systems 2012
- Optimistic Replication 2005
- Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services 2002
- Harvest, Yield, and Scalable Tolerant Systems 1999
- Linearizability: A Correctness Condition for Concurrent Objects 1990
- Time, Clocks, and the Ordering of Events in a Distributed System 1978
Conflict-free data structures
Studies on data structures which do not require coordination to ensure convergence to the correct value.
- A Comprehensive Study of Convergent and Commutative Replicated Data Types 2011
- A Commutative Replicated Data Type For Cooperative Editing 2009
- CRDTs: Consistency Without Concurrency Control 2009
Distributed programming
Languages aimed towards disorderly distributed programming as well as case studies on problems in distributed programming.
- Logic and Lattices for Distributed Programming 2012
- Dedalus: Datalog in Time and Space 2011
- MapReduce: Simplified Data Processing on Large Clusters 2004
- A Note On Distributed Computing 1994
Systems
Implemented and theoretical distributed systems.
- Spanner: Google’s Globally-Distributed Database 2012
- ZooKeeper: Wait-free coordination for Internet-scale systems 2010
- A History Of The Virtual Synchrony Replication Model 2010
- Cassandra — A Decentralized Structured Storage System 2009
- Dynamo: Amazon’s Highly Available Key-Value Store 2007
- Stasis: Flexible Transactional Storage 2006
- Bigtable: A Distributed Storage System for Structured Data 2006
- The Google File System 2003
- Lessons from Giant-Scale Services 2001
- Towards Robust Distributed Systems 2000
- Cluster-Based Scalable Network Services 1997
- The Process Group Approach to Reliable Distributed Computing 1993
Books
Overviews and details covering many of the above papers and concepts compiled into single resources.
- Distributed Systems: for fun and profit 2013
- Programming Distributed Computing Systems: A Foundational Approach 2013
- Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services 2012
- Introduction to Reliable and Secure Distributed Programming 2011
I’m hoping to make this into a living document, so please submit pull requests or leave comments!