This post originally appears on the Macrometa blog.
When deploying web and mobile applications today, developers typically start with an initial deployment of their application in a single data center. Here, the primary state for the application is kept in a highly-available database, and clients located around the world issue read and write requests against this database through an application server also located in the same data center. In this model, not all clients get the same quality of service from the database – some clients, depending on how geographically distant they are from the data center, see much higher costs in terms of latency, leading to slower application performance and a degraded user experience. Therefore, in order to provide users located all over the world with a rich, native application experiences, developers typically have to resort to geo-replication.
Geo-replication serves two purposes: high-availability and fault-tolerance. First, geo-replication allows applications to survive network and data center outages by routing users to other data centers in the event of failure. Second, geo-replication places replicas all over the globe, thereby allowing users to communicate with the data center that is geographically closest, resulting in lower latency interactions with the application. While these two properties together help to provide users with a fluid, always-on experience with the application, they raise a number of concerns around how to remain consistent between replicas when things inevitably go wrong and failures occur.
When a failure occurs that prohibits one or more of the data centers from communicating with the others, the system is forced to make a choice when presented with a request from the application. Either the system can refuse the request (or block it arbitrarily), or the system can allow the request to proceed, resulting in a potential inconsistency between the data centers in the system and the possibility of data loss under subsequent failures; systems making the former design choice are typically referred to as Consistent (CP) systems, whereas systems making the latter design choice are referred to as Available (AP).
Consistent systems aim to provide users with a “single system image” experience - where the distributed system is given the appearance of a single machine and the failures of individual components are invisible to the end user. These systems have a gold standard for correctness known as linearizability (or in transactional systems, strict serializability) which roughly states that read operations for an object will reflect the result of the most recent write to that object in a manner that is compatible with the real time order of events. Even under a certain number of faults these guarantees must not be sacrificed. To achieve these guarantees, the system typically must ensure that all (or a majority) of the data centers acknowledge writes from an elected coordinator of the group who is responsible for ordering the operations, before responding to the application.
Available systems sit on the opposite side of the spectrum. Acknowledging that failures are unavoidable, available systems typically allow the application to proceed as soon as the operation is acknowledged at one (or more) of the data centers. However, these systems typically place a burden on the application developer for a number of reasons. First, if an application has to fail over to another data center, an object that was just written may no longer be readable; in fact, without a guarantee of ordering any number of the previous writes may no longer be visible during failover. Second, without involving a majority of the data centers and using a single data center for ordering the reads and writes that occur in the system, conflicting writes to the same object can occur at different data centers at the same time: these writes do not even need to occur at precisely the same second, but only in a window that occurs between when the two data centers last communicated, to induce a conflict that must be manually resolved. Third, if a write is only acknowledged at a single data center, failure of that data center before the value can be transmitted to the other data centers might result in data loss.
In edge systems, these issues are unavoidable. Edge systems are designed specifically for reducing the user perceived end-to-end latency for applications. This requires that applications and the data needed to run these applications is replicated as close to the user as possible. Given the goals of achieving low latency for all users of the applications, these replicas must be numerous – an order of magnitude more than the numbers of traditional data center deployments. These additional networks, additional data centers, and additional replicas only increase the probability of failure, thereby exacerbating the aforementioned issues in modern stateful edge deployments.
Since edge deployments are designed to be close to the user in order to reduce user-perceived latency for read and write operations, they must be able to accept these writes locally without blocking. If these nodes wait to coordinate with a primary datacenter, either by using consensus or another mechanism, users will observe the cost of a coordinated write: the very round-trip operation that they were trying to avoid by placing a replica close to the user. Therefore, conflict resolution is unavoidable.
One mechanism for automatic conflict resolution is Conflict-Free Replicated Data Types (CRDTs.) Conflict-Free Replicated Data Types are abstract data types that codify deterministic merge behavior into the data structure itself. There are two main flavors of CRDT, each designed for use in different environments. For environments where causal ordering can be provided, there are operation-based CRDTs. For environments where no ordering can be guaranteed at all, there are state-based CRDTs.
When no ordering can be guaranteed, state-based CRDTs are ideal. State-based CRDTs are constructed using a mathematical structure called a semilattice as the base for building a data type. Semilattices, in particular bounded join-semilattices, have a unique property where for any divergent state (or, concurrent) there is always a well-defined join point. This maps nicely into the domain of distributed systems where for any two divergent replicas, they always can be merged without coordination to the same result. In order to ensure this merge behavior under any ordering, these data structures typically need to store metadata internally that models, for each change, information summarizing the changes already observed so that when an update is potentially duplicated or delivered out of order on the network, the data structure can identify this. Similarly, this data structure must also encode enough information that it can detect divergent state. There’s been significant work in the area of CRDT on reducing the amount of metadata required for state-based CRDTs, with some work reducing the metadata as far as a constant overhead and linear with the number of elements for collection types.
When causal ordering can be provided, we can do better. Causal ordering ensures that updates are delivered with respect to the order in which updates were performed – with no real-time bound on visibility. Think of it this way: each node’s updates sent to its peers are delivered in FIFO (first-in-first-out) order, and the transitive closure of those updates is also delivered in order – if node A generated update 2 after observing update 1 from node B – node C will observe the updates in any order that is compatible. Therefore, some of the information typically stored in the state-based CRDT around message ordering is now stored somewhere in the network. It’s important to realize that the system still need this information somewhere, it’s about where that information is stored and enforced. Operation-based CRDTs exploit this to reduce the overhead required for object storage. Instead of modeling abstract data types as a lattice, we model changes to the objects state as operations. These operations only need to guarantee one property, commutativity, which ensures that two updates that occurred at the same logical time, when permuted in anyway, arrive at the same result.
However, designing data types remains a challenge. If we consider a set that allows concurrent additions and removals of items in the set, it’s rather straightforward to map many of the invariants of the sequential specification to the concurrent and distributed setting: concurrent adds of the same element commute, concurrent removals of the same element commute. But, not all of these invariants can be easily mapped into the distributed setting. For example, what happens when an element is concurrently added and removed at the same time at different replicas? In this case, in order for the system to remain deterministically convergent without coordination, the designers of these data types need to decide how these operations will be arbitrated up-front. This has led to many variations of these data types: one data type variant for each pair of concurrent operations that doesn’t commute. In the case of the concurrent addition and removal on a set, there’s two variants: one that biases towards keeping the addition and ignoring the removal, and one that biases towards removals and ignores the additions.
The challenges do not stop there. One of the most primitive and valuable data types, the register, poses similar challenges. Registers do not have the benefit of different types of operations by which we can arbitrate (as in the previous example we had both add and remove operations on a set). Registers only have a single operation: set, to modify the value in place. Therefore, every concurrent operation in the system will conflict and we must make a decision ahead of time how we will resolve these conflicting updates. There are several approaches taken by many of the different systems that implement distributed registers. First, we could keep both values - however, this is problematic if an application is expecting to read a single value from the database and instead is provided with a set when reading from the database. Second, we could associate a timestamp with each write and have the system pick the most recent update - however, clocks are known to drift and the system may pick the wrong value (furthermore, a clock drifting into the future can prevent any writes at the current time, a phenomenon in Apache Cassandra called “doomstones.”) Finally, another possible option (and, not the only remaining option) arbitrarily picks one of the updates – however, to remain correct this must be done in a manner that all nodes pick the exact same update, without requiring coordination in the critical path.
An open question remains: is there a way to make practical choices on how to resolve these conflicts in a manner that is easy to implement for developers, while admitting the difficulty of solving these in a mechanism that suits every possible case and every possible conflict resolution strategy? That will be the subject of our next post: how the Macrometa convergence engine works.