This post originally appears on the Macrometa blog.
In our last post, we discussed the challenges of conflict resolution when supporting a highly-available distributed database that operates at the edge. Let’s recap the challenges.
Moving applications to the edge is primarily done for two reasons: latency and availability. First, by moving application code and associated application data to the edge, user requests can be processed entirely at the edge and responses returned to the user quicker; this serves to reduce the user-perceived latency. Second, in the event of a data center or connectivity failure, user requests can be rerouted to another geographically close Point-of-Presence preventing periods of application inaccessibility, allowing the user to continue using the application without an interruption of service. Even further, if the user’s geographically closest Point-of-Presence is unable to contact other data centers because of connection failure, users can operate on data local to that location without having to experience an interruption of service.
Enabling this type of user experience ultimately means that conflicts at the data store are inevitable. Without waiting for agreement from all of the data centers in the system before modifying an object in the system – the key to remaining available under network partitions that are a result of data center failures and connectivity loss – concurrent modifications to the same object in the data store may conflict. These conflicts need not occur simultaneously – but only within a time window where a relationship between the two modifications cannot be determined.
To provide a practical example, if Alice and Bob both modify a document in their address book containing the current address of a relative – even if the values written are the same or minor variations using abbreviations – the system will detect a conflict and will be unable to proceed. Generally speaking, in this case the system has to make one of two choices: either keep both values, and upon the next read performed by the user return the user multiple values, when they may have been expecting a single value; or, pick one of the updates as the “winning” updates, by using some sort of arbitration procedure to figure out which of the modifications “overwrites” the other. Some systems (e.g., Riak, Bayou, Dynamo) return both values to the user and allow the user to replace the multiple values with a single value – a form of application-specific conflict resolution; whereas other systems (e.g., Cassandra) choose a single value based on a user provided time stamp. In some of the formal literature on the topic the former is known as a Multi-Value Register, the latter a Last-Writer-Wins Register. However, since applications work with richer data types than just registers – they use dictionaries, sets, and counters – keeping both values or arbitrarily choosing one might not provide the correct application behavior for the user. To provide a pathological example, if a counter is modeled as a register, for every conflicting set of writes a single increment or decrement of the counter would be lost – this may be very bad for the application!
Coming up with interesting ways to converge conflicting operations across different data types has a whole field of study and body of work behind it called Conflict-Free Replicated Data Types (CRDTs). We touched on the challenges of designing CRDTs in our previous post. In this post, we discuss how the Real-Time Intelligent Convergence Engine – at the heart of Macrometa’s C8DB offering – handles conflict resolution.
Given that Macrometa’s platform must face the challenges of convergence and conflict resolution head-on to remain available at the edge when network partitions inevitably occur, it was paramount to establish a data model that was both usable from the application point-of-view and contained conflict resolution logic that was transparent to the user.
In C8DB, application users store unstructured documents – similar to JSON objects, or documents in a document-oriented database such as CouchDB. Documents are values that are associated with keys, and applications can either access a document by key – using a point query – or use a SQL-inspired query language that allows for aggregations and selections across documents located in the system. Documents can be arbitrarily nested and therefore are recursive data structures containing keys that map to values (which, themselves can be dictionaries mapping further keys to values.) To perform modifications to a document, users use the query language to perform INSERT or UPDATE operations against the data store – these operations can modify either a single document or a collection of documents.
Internally, these operations are totally ordered at each data center. Therefore, the types of conflicts the application developer needs to reason about are conflicting operations that happen concurrently at different data centers. These conflicting operations occur when the system cannot reason about which update preceded the other, nor when the system can determine some precedence between the different types of operations. In our example before, if Alice and Bob were to update the street address for the same record concurrently at different data centers, we have to make a choice. In lieu of using different timestamps provided by the user to arbitrarily pick a write – which, has been known to result in serious consistency problems in Cassandra when clock skew occurs – RICE instead chooses one of the updates arbitrarily from the conflicting writes based on the data center’s unique identifier. While this may not capture user intent, it ensures convergence and prevents clock skew from rendering the system unwritable if a clock happens to skew forward in time. These updates are performed per field in the document and when update-update conflicts occur where the system cannot determine an answer that reflects the true user intent.
When more information is available, the system can do better. If we consider the case where we have a document containing multiple keys and concurrently Alice removes a field while Bob is updating that same field to a new value, we can order Bob’s update after Alice’s, under the assumption that Alice’s removal is performed given the current value of the field – therefore, Bob’s update should survive and not be removed based on Alice’s modification. Again, we can perform this arbitration on a per field basis, and try to preserve user intent. While this might seem reasonable, this also can lead to unintended consequences.
Consider the case of an address book entry as a document containing street address, state, and a boolean flag to identify whether or not the user lives in Washington state. Alice updates the document to modify the address from Washington to California and untoggles the boolean; simultaneously, Bob updates the address to a different address in Washington state. Alice is located in DC2, Bob in DC1. In this case, we run into trouble using any methods when we try to converge.
Let’s examine a few of the many cases:
Using data center identifiers and per field convergence, if we send only the modified fields to the system for convergence, we result in a checked boolean and an address in California. This is clearly wrong. If we model our boolean as a register, we will not converge correctly, unless Bob sends an update for the boolean as well. If we model our boolean using semantics where booleans should always merge from a false value to a true value, and model Bob’s change as an update of both the state (to its current value) and new address, this merges correctly, but only because Alice’s updates originated in DC2. If Alice was located in DC1, it would not have. If Bob is located in DC1, we merge correctly. The difficulty here is that the value of the boolean depends on the value of the address: this is an invariant that exists within the document itself, and converging documents based only on the fields in the document is not sufficient to perform convergence while preserving the invariant. Therefore, we need another mechanism to ensure that we can preserve these invariants under conflict resolution.
In RICE, these groups of updates to a document can be performed atomically. Atomic groups of updates ensure that convergence is performed per document where all of the updates for a given document are ordered together when performing conflict resolution – thereby helping the developer to ensure invariants. However, these atomic updates are only good for ensuring one-way invariants: where a given value depends on another value in the document; they are insufficient for preserving two-way invariants where the values of two fields of a document are interdependent and conditional on one another. For that, we refer you to the next post in our series on transactions.