I recently spent some time writing an Erlang/OTP application where I
needed to broadcast messages to a series of blocking
webmachine resources, which would return multipart
responses to the user derived from the messages each process received.
I implemented this prototype using the built in
distributed named process group registry provided by OTP, as the
webmachine resources would be split up amongst a series of nodes.
However, unsure of how
pg2 actually performs during node
unavailability and cluster topology changes, as well as reports of
problems due to race conditions and partitions, I
set out to figure out exactly what the failure semantics of
pg2 is a distributed named process group registry, which has replaced
the experimental process group registry
pg. One of the major
pg is that rather than being able to message a group
pg2 will only return a process listing, and messaging is
left up to the user. This drastically simplifies the domain, as the
library no longer needs to worry about partial writes to the group, or
failed message sends due to processes being located across a partition,
before the Erlang distribution protocol determines the node is
unreachable and times out the connection.
First, we can register a process group on one node, and see the results on the clustered node.
We’ll run these operations in a
with two clustered nodes
Getting the members, then returns the membership list containing no
processes, rather than throwing
However, you can see interesting behaviour if you attempt to use
prior to starting the
pg2 application, as it’s started on demand.
pg2 also, maybe counterintuitively, allows processes to join the group
multiple times, which is something to be aware of when leveraging
as a publish/subscribe mechanism.
pg2handle a node addition?
First, register a group and add a process to it.
Then, start a third node and register two groups, one with the same name, and one with a different name.
Now, join the node to the cluster.
Everything resolves nicely.
pg2handle a partition?
At this point, we suspend the
[email protected] node, and wait the
net tick time interval for the node to be detected as down.
And, once the node is restored the process returns.
When a remote
pid is registered, and its state transferred to another
pg2 registration system sets up an Erlang monitor to that
node. When that node happens to become unavailable, processes located on
remote nodes are removed from the process group. When the node returns,
it transfers its state, repopulating the process list with the processes
which are now available.
delete request will block until the unavailable node times
out. At that point the group will be considered deleted.
However, when the partition heals, the second node will re-create the group on the primary.
We also observe the same timeout behaviour when performing
create operations, however, once the disconnected node is reconnected
the correct state is propagated out.
When performing a
join on either side of a partition, we
can also observe that when the partition heals, the updates are
createwith a three node cluster?
Here’s the state before the partition is healed.
At this point node 2 is completely disconnected from the rest of the nodes.
Here’s the state after the partition is healed.
leavewith a three node cluster?
Then we leave during the partition, which can only happen on the node running the process.
And, we see that the updated state is propagated.
pg2 has some interesting semantics regarding processes able to
register themselves in the process group, and with groups resurrecting
themselves during the healing of a partition, the failure conditions
during cluster membership and partitions appears to be straightforward,
I’m hopeful that I didn’t miss any obvious failure scenarios. Feedback welcome and greatly appreciated!
Updated 2013-06-04: Added details on how node addition is handled in
View this story on Hacker News.