I recently spent some time writing an Erlang/OTP application where I
needed to broadcast messages to a series of blocking
webmachine
resources, which would return multipart
responses to the user derived from the messages each process received.
I implemented this prototype using the built in pg2
distributed named process group registry provided by OTP, as the
webmachine resources would be split up amongst a series of nodes.
However, unsure of how pg2
actually performs during node
unavailability and cluster topology changes, as well as reports of
problems due to race conditions and partitions, I
set out to figure out exactly what the failure semantics of pg2
actually are.
pg2
?pg2
is a distributed named process group registry, which has replaced
the experimental process group registry pg
. One of the major
departures from pg
is that rather than being able to message a group
of processes, pg2
will only return a process listing, and messaging is
left up to the user. This drastically simplifies the domain, as the
library no longer needs to worry about partial writes to the group, or
failed message sends due to processes being located across a partition,
before the Erlang distribution protocol determines the node is
unreachable and times out the connection.
pg2
work?First, we can register a process group on one node, and see the results on the clustered node.
We’ll run these operations in a riak_core
application,
with two clustered nodes riak_pg1
, riak_pg2
.
Getting the members, then returns the membership list containing no
processes, rather than throwing {error, no_such_group}
.
However, you can see interesting behaviour if you attempt to use pg2
,
prior to starting the pg2
application, as it’s started on demand.
pg2
also, maybe counterintuitively, allows processes to join the group
multiple times, which is something to be aware of when leveraging pg2
as a publish/subscribe mechanism.
pg2
handle a node addition?First, register a group and add a process to it.
Then, start a third node and register two groups, one with the same name, and one with a different name.
Now, join the node to the cluster.
Everything resolves nicely.
pg2
handle a partition?At this point, we suspend the [email protected]
node, and wait the
net tick time interval for the node to be detected as down.
And, once the node is restored the process returns.
When a remote pid
is registered, and its state transferred to another
node, the pg2
registration system sets up an Erlang monitor to that
node. When that node happens to become unavailable, processes located on
remote nodes are removed from the process group. When the node returns,
it transfers its state, repopulating the process list with the processes
which are now available.
First, the delete
request will block until the unavailable node times
out. At that point the group will be considered deleted.
However, when the partition heals, the second node will re-create the group on the primary.
We also observe the same timeout behaviour when performing leave
and
create
operations, however, once the disconnected node is reconnected
the correct state is propagated out.
When performing a create
and join
on either side of a partition, we
can also observe that when the partition heals, the updates are
performed correctly.
create
with a three node cluster?Here’s the state before the partition is healed.
At this point node 2 is completely disconnected from the rest of the nodes.
Here’s the state after the partition is healed.
leave
with a three node cluster?Then we leave during the partition, which can only happen on the node running the process.
And, we see that the updated state is propagated.
While pg2
has some interesting semantics regarding processes able to
register themselves in the process group, and with groups resurrecting
themselves during the healing of a partition, the failure conditions
during cluster membership and partitions appears to be straightforward,
and deterministic.
I’m hopeful that I didn’t miss any obvious failure scenarios. Feedback welcome and greatly appreciated!
Updated 2013-06-04: Added details on how node addition is handled in
pg2
.
View this story on Hacker News.