This is one post in a series about programming models and languages for distributed computing that I’m writing as part of my history of distributed programming techniques.
Abstraction Mechanisms in CLU, Liskov, Barbara and Snyder, Alan and Atkinson, Russell and Schaffert, Craig, CACM 1977 (Liskov et al. 1977).
Guardians and Actions: Linguistic Support for Robust, Distributed Programs, Liskov, Barbara and Scheifler, Robert, TOPLAS 1982 (Liskov and Scheifler 1983).
Orphan Detection in the Argus System, Walker, Edward Franklin, DTIC 1984 (Walker 1984).
Implementation of Argus, Liskov, Barbara and Curtis, Dorothy and Johnson, Paul and Scheifer, Robert, SIGOPS 1987 (Liskov et al. 1987).
Distributed Programming in Argus, Liskov, Barbara CACM 1988 (Liskov 1988).
“However, regardless of advances in hardware, we believe atomic actions are necessary and are a natural model for a large class of applications. If the language does not provide actions, the user will be compelled to implement them, perhaps unwittingly reimplementing with each new application, and may implement them incorrectly.”
The focus of these papers is the ARGUS system (and it’s roots with the programming language CLU), developed in the Laboratory for Computer Science at the Massachusetts Institute of Technology. ARGUS was designed to solve the problem of programming language support for the construction and maintenance of distributed programs constructed from modules executing at, and communicating from, geographically distinct nodes. ARGUS was designed with the following goals in mind:
Service: Programs should have localized failures, and be geographically distributed with replicated data for both fault-tolerance and availability.
Reconfiguration: Software should be able to be reconfigured while the system is running: this allows for the addition of additional capacity to increase processing power, decrease response times, or increase the availability of data.
Autonomy: Nodes in the system may be owned by individuals or organizations that need to control what data is replicated at the node for political or sociological reasons.
Distribution: Programs should be able to control explicit placement of data to ensure responsiveness and cost-effectiveness of hardware in the system.
Concurrency: Distribution should be able to exploit the available concurrency in the system.
Consistency: Consistency must be maintained, specifically for invariant preservation: for instance, conservation of funds during a transfer between two accounts.
Of all of the aforemtioned concerns, the authors posit that the most difficult of these to provide is consistency: consistency becomes much more difficult to preserve in a system where coordination is minimized to avoid interference and mask failures of the network. Therefore, the authors present a technique for integrating atomicity as fundamental concept in a programming language.
Since data resiliency only guarantees consistency under a quiescent environment, it is necessary to make some operations in the system atomic. When the authors state atomic, they mean two things:
Indivisibility: the execution of an activity in the system never appears to overlap with another activity in the system.
Recoverability: the overall effect of an activity is all or nothing: in the event of a failure, the activity must be completed after recovery or all objects must be restored to their initial state.
ARGUS introduces actions: atomic activities that need to either be committed or aborted and provide both the indivisibility and recoverability properties. If an action happens to abort, it is as if the the effects had never happened; if an action happens to commit, all modified objects in the system take on their new state. However, since ARGUS is a concurrent system, shared objects that will be manipulated by concurrent atomic actions must be done via synchronization, with read/and write locks.
To achieve this, ARGUS introduces a type of atomic object, or as they call it, atomic abstract data types: sets of objects and primitive operations for interacting with those objects. These objects are similar to normal data types, but are extended with operations that ensure indivisibility and recoverability. Access to objects within an atomic abstract data type are done through locking: read/write locks with versioning is used to facilitate concurrency access in the system. Write operations operate against a copy of the current version, and when the action completes, the new version of the object is written, replacing the old.
The authors argue for indivisibility as a desired property, even if it reduces the amount of potential concurrency in the system:
“It has been argued that indivisibility is too strong a property for certain applications because it limits the amount of potential concurrency. We believe that indivisibility is the desired property for most application, if it is required only at the appropriate levels of abstraction. ARGUS provides a mechanism for user-defined atomic data types.”
Nested actions, or subactions, can be used to further divide actions and introduce concurrency within an action: each action can contain any number of subactions that can be either executed sequentially or concurrently. Subactions can commit or abort independently and this behavior has no impact on the parent action. However, an abort of a parent will force the abort of all subactions.
Nested actions provide support for the composition of actions into a larger action where some of the nested actions can run in parallel or might fail and have to be compensated for. For instance, if one subaction has to contact a remote guardian that may be unavailable, that subaction can abort and the operation retried against another remote guardian: the authors refer to this as “several ways to accomplish a task.”
Because nested actions may fail, interact with several guardians, and have “checkpointing” behavior as they either abort or succeed, each nested actions ancestor needs to track metadata during the execution:
Each subaction will inherit locks from its ancestor: read locks are directly inherited, but write locks can only be inherited if all read locks can also be inherited.
Each subaction can abort or succeed: if the action aborts, the ancestor will restore the state, or “checkpoint”, of the execution to before the operation was executed.
Each version created by a subaction needs to be propagated, to ensure that the effects of a previous write to a successful nested action are available to a subsequent subaction.
Each top-level action will track what’s referred to as a plist, which is a list of all of the guardians contacted during the execution of the action to be used in the final two-phase commit protocol to commit the action and write to stable storage.
The authors argue for Remote Procedure Call and state that nested actions can be used for masking the communication failures inherent in the paradigm:
“In fact, we believe the form of communication that is needed is remote procedure call, with at-most-once semantics, namely, that (effectively) either the message is delivered and acts on exactly once, with exactly one reply received, or the message is never delivered and the sender is so informed.”
The authors believe that low-level issues from the system should be shielded from the user, such as packet retransmission, but do believe that the system should make a reasonable attempt of delivery for messages. However, the users believe that the notion that long delays are possible and that ultimate failure is possible, should be made present to the user. The authors propose that the subaction should be used for this: individual subactions can abort without causing the entire parent action to abort: in the case where these subactions do abort, the user should be able to take action, such as try an alternative replica to service the request.
In this model, some top-level actions can not be aborted. Consider the case of an airline reservation system: the clerk can initiate operations for making a reservation and subactions can take action and try multiple replicas if, for instance, the primary can not be contacted. However, an external event performed by a top-level action, such as printing a check, can not be undone and needs to have a compensating action for dealing with this behavior. To alleviate this problem, the authors suggest breaking these into two separate top-level actions, where they are sequenced to ensure the completion of one before the execution of the subsequent operation.
Finally, the authors acknowledge that even in this model timeouts are necessary to resolve issues that might arise from contention or deadlocks between atomic resources.
We won’t provide the full semantics of how the ARGUS language works, but provide a brief write up that should give you the general idea of how the system works.
In ARGUS, distributed programs are composed of a group of guardians. Guardians encapsulate and control access to one or more resources and makes them available through operations called handlers, which are called by other guardians. Guardians contain both data and processes: process execute handlers and are spawned for each call to a handler by another guardian. Processes have access to objects that make up the state of the guardian and this is how access to objects is controlled.
Guardians contain both stable and volatile objects: volatile objects are lost when nodes running a guardian fail. When a guardian fails, the language support system is responsible for recreating the guardian with the objects that were persisted to stable storage. Stable state is versioned when modifications are performed, and volatile objects live on the heap. Guardians are created dynamically and the node where the guardian runs is specified by the programmer: guardians represent a logical node of the system where communication is performed by handlers, an abstraction of state and the physical network.
Processes in guardians execute concurrently and the language controls access to atomic objects using the locking mechanisms described above: in addition, a coroutining facility allows subactions to be further divided into concurrent executions, yielding more concurrency from the system.
ARGUS provides static type checking at compile time and loading of new guardians and handlers while the system is running: to ensure this happens safely, types must be known by the system before loading a handler that uses these types. Guardians and handlers are first class and can be used as both arguments and return values from handler invocations.
As the authors highlight, the major drawbacks of the system are in security and scheduling. ARGUS provides no mechanism for securing guardians from particular guardians or preventing nodes from launching certain types of guardians. ARGUS also has no mechanism for priority servicing of calls, backpressure, or load-shedding, leading the system susceptible to scheduling problems when overloaded.
Finally, the authors acknowledge that concurrent programming is hard. While ARGUS goes to great efforts to simplify the problem of maintaining consistency in a distributed application, as highlighted by the banking example, the ARGUS system made it all to easy to introduce deadlocks by having concurrent processes in a guardian, or concurrent actions, take out read/write locks in the wrong order.
ARGUS was arguably the first system to provide a distributed programming paradigm where consistency was integrated in the language as a first class concept. ARGUS built on top of CLU, and provided a series of abstractions directly assisting programmers in writing code that maintained consistency, as the authors felt this was the main challenge in distributed programming.
ARGUS served as the platform where other ideas would blossom: both Remote Procedure Call was used as the primary abstraction for making remote calls, where subactions were used to mask omissions. The optimizations that Promises provided to pipeline RPCs and ensure ordering of requests and responses also were originally developed in the ARGUS system.
The idea of compensating transactions in ARGUS for top-level actions that can not be “recovered” also appeared later in the SAGA design (Garcia-Molina and Salem 1987).
Finally, languages like Bloom (Alvaro et al. 2011), while not providing abstractions for consistency, take an alternative approach by providing analysis tools for determining where consistency could be violated and cause program anomalies. The authors propose to plug in a distributed consensus mechanism, such as Zookeeper, to ensure consistency while performing these changes.
Alvaro, Peter, Neil Conway, Joseph M Hellerstein, and William R Marczak. 2011. “Consistency Analysis in Bloom: A Calm and Collected Approach.” In CIDR, 249–60. Citeseer.
Garcia-Molina, Hector, and Kenneth Salem. 1987. Sagas. Vol. 16. 3. ACM.
Liskov, Barbara. 1988. “Distributed Programming in Argus.” Communications of the ACM 31 (3). ACM: 300–312.
Liskov, Barbara, and Robert Scheifler. 1983. “Guardians and Actions: Linguistic Support for Robust, Distributed Programs.” ACM Transactions on Programming Languages and Systems (TOPLAS) 5 (3). ACM: 381–404.
Liskov, Barbara, Dorothy Curtis, Paul Johnson, and Robert Scheifer. 1987. “Implementation of Argus.” ACM SIGOPS Operating Systems Review 21 (5). ACM: 111–22.
Liskov, Barbara, Alan Snyder, Russell Atkinson, and Craig Schaffert. 1977. “Abstraction Mechanisms in Clu.” Communications of the ACM 20 (8). ACM: 564–76.
Walker, Edward Franklin. 1984. “Orphan Detection in the Argus System.” DTIC Document.