Erlang gproc Failure Semantics

05 Jun 2013

In my previous post, we discussed the failure semantics of pg2 during partitions and when dealing with dynamic cluster management. In this post, we’re going to look at a common alternative to pg2, the extended process registry, gproc.

So, what is gproc?

gproc, as described by Ulf Wiger in his 2007 paper, outlines an extended process registry for Erlang which provides a number of benefits over the default process registry. It also provides a global mode, implemented by leveraging gen_leader, a leader election algorithm based roughly on a 2005 paper and original implementation by Thomas Arts and Hans Svensson.

Briefly, some of the notable improvements include:

  • Any term can be a process alias
  • Processes can have multiple aliases (also non-unique aliases supported)
  • Counters, aggregated counters
  • Global distribution of the process registry
  • Thoroughly tested locally with Erlang QuickCheck

We’re specifically going to look at emulating a pg2-like interface with gproc, by tagging groups of processes with a name, and querying the gproc to get the list of processes.

Let’s explore the API using just the local distribution model of gproc.

First, register the current process and ask for the registered processes under that name.

(riak_pg1@127.0.0.1)1> gproc:reg({p, l, group}).
true
(riak_pg1@127.0.0.1)2> gproc:lookup_pids({p, l, group}).
[<0.501.0>]

To explain the registration key: group is a term alias for the process, p represents a non-unique property, so we can tag multiple processes, and l means register locally.

Some application changes to support gproc.

We start off by adding a dependency on gproc to the test application we used last time to test pg. We also need to add gen_leader as a dependency to support the global distribution of the registry.

We’ll use the current recommended version of gen_leader by gproc, which is known to have multiple problems during netsplits.

{gproc, ".*", 
 {git, "git://github.com/uwiger/gproc.git", "master"}},
{gen_leader, ".*",
 {git, "https://github.com/garret-smith/gen_leader_revival.git", "master"}}

We’ll also modify the vm.args file to enable global distribution.

-gproc gproc_dist all

Done.

First things first…

It’s important to note that nodes must be connected prior to starting the gproc application (which is ultimately responsible for starting gen_leader with a candidates list.)

Let’s see this in action. Starting with our same riak_core cluster of three nodes, we’ll start the first two nodes and inspect the gproc state.

Again, to explain the registration key: group is a term alias for the process, p represents a non-unique property, so we can tag multiple processes, and g means register globally.

(riak_pg1@127.0.0.1)1> gproc:reg({p, g, group}).
true
(riak_pg1@127.0.0.1)2> gproc_dist:get_leader().
'[email protected]'
(riak_pg1@127.0.0.1)4> nodes().
['[email protected]']
(riak_pg1@127.0.0.1)5> gproc:lookup_pids({p, g, group}).
[<0.489.0>]
(riak_pg2@127.0.0.1)3> gproc:reg({p, g, group}).
true
(riak_pg2@127.0.0.1)4> nodes().
['[email protected]']
(riak_pg2@127.0.0.1)5> gproc_dist:get_leader().
'[email protected]'
(riak_pg2@127.0.0.1)6> gproc:lookup_pids({p, g, group}).
[<0.996.0>]

As the gproc application was started prior to the nodes being connected, they each are their own leader with their own state, which is not replicated.

Now, let’s restart the gproc application once we know for sure the nodes are connected.

(riak_pg1@127.0.0.1)7> nodes().
['[email protected]']
(riak_pg1@127.0.0.1)8> application:start(gproc).
ok
22:59:20.357 [info] Application gproc started on node '[email protected]'
(riak_pg1@127.0.0.1)9> gproc:reg({p, g, group}).
true
(riak_pg1@127.0.0.1)10> self().
<0.489.0>
(riak_pg1@127.0.0.1)11>
(riak_pg2@127.0.0.1)8> application:start(gproc).
ok
22:59:33.919 [info] Application gproc started on node '[email protected]'
(riak_pg2@127.0.0.1)9> gproc:lookup_pids({p, g, group}).
[<12417.489.0>]
(riak_pg2@127.0.0.1)10>

So, what happens when a node becomes unreachable?

First, let’s make the first node, which has the registered process, unreachable.

(riak_pg2@127.0.0.1)10> gproc:lookup_pids({p, g, group}).
[<12417.489.0>]
(riak_pg2@127.0.0.1)11> 
23:08:28.289 [error] ** Node '[email protected]' not responding **
** Removing (timedout) connection **
(riak_pg2@127.0.0.1)11> gproc:lookup_pids({p, g, group}).
[<12417.489.0>]
(riak_pg2@127.0.0.1)12> gproc:lookup_pids({p, g, group}).
[]

The process is available for a short period, but gets removed shortly after the net tick time period times out the connection.

When the partition heals, we see the process return.

(riak_pg2@127.0.0.1)26> gproc:lookup_pids({p, g, group}).
[<12417.489.0>]

What happens when an addition happens during a partition?

(riak_pg1@127.0.0.1)20> gproc:lookup_pids({p, g, group}).
[<0.10439.0>]
23:23:19.325 [error] ** Node '[email protected]' not responding **
** Removing (timedout) connection **
(riak_pg2@127.0.0.1)38> gproc:lookup_pids({p, g, group}).
[]
(riak_pg2@127.0.0.1)39> gproc:reg({p, g, group}).
true
(riak_pg2@127.0.0.1)40> gproc:lookup_pids({p, g, group}).
[<0.996.0>]

And, when we heal that partition.

(riak_pg2@127.0.0.1)41> gproc:lookup_pids({p, g, group}).
[<0.996.0>,<12417.10439.0>]
(riak_pg1@127.0.0.1)21> gproc:lookup_pids({p, g, group}).
[<12417.996.0>,<0.10439.0>]

So, what happens when we add a node with an unresponsive leader?

First, identify the leader.

(riak_pg1@127.0.0.1)33> gproc:lookup_pids({p, g, group}).
[<0.13370.0>,<12417.14190.0>]
(riak_pg1@127.0.0.1)34> gproc_dist:get_leader().
'[email protected]'

Now, take the leader down.

(riak_pg2@127.0.0.1)49> gproc:lookup_pids({p, g, group}).
[<12417.13370.0>,<0.14190.0>]
(riak_pg2@127.0.0.1)50> gproc_dist:get_leader().
'[email protected]'
(riak_pg2@127.0.0.1)51> 
23:52:04.443 [error] ** Node '[email protected]' not responding **
** Removing (timedout) connection **
(riak_pg2@127.0.0.1)51> gproc_dist:get_leader().
'[email protected]'
(riak_pg2@127.0.0.1)52> gproc:lookup_pids({p, g, group}).
[<0.14190.0>]
(riak_pg2@127.0.0.1)53>

Then, join the third node.

(riak_pg3@127.0.0.1)3> gproc_dist:get_leader().
** exception exit:
{timeout,{gen_leader,local_call,[gproc_dist,get_leader]}}
in function  gen_leader:call/2 (src/gen_leader.erl, line 326)
(riak_pg3@127.0.0.1)4> gproc:lookup_pids({p, g, group}).
[]

Finally, restore the original leader.

(riak_pg2@127.0.0.1)53> gproc:lookup_pids({p, g, group}).
[<12417.13370.0>,<0.14190.0>]
(riak_pg2@127.0.0.1)54> gproc_dist:get_leader().
'[email protected]'
(riak_pg2@127.0.0.1)55>

Now, let’s check in on the third node.

riak_pg3@127.0.0.1)6> nodes().
['[email protected]','[email protected]']
(riak_pg3@127.0.0.1)7> gproc_dist:get_leader().
** exception exit:
{timeout,{gen_leader,local_call,[gproc_dist,get_leader]}}
in function  gen_leader:call/2 (src/gen_leader.erl, line 326)
(riak_pg3@127.0.0.1)8>

We see that it’s fully connected, however the get_leader call continues to fail with timeouts until we restart gproc on the first two nodes. In fact, if we restart gproc on all three nodes too quickly, simulating small partitions during the restart period, we run into the same situation again, with nodes being unable to determine who the leader is due to deadlock. Some of these issues are documented in Ulf’s 2007 paper in the future work section as well.

Things get worse…

Let’s register a unique name on the primary node instead of the non-unique property types.

(riak_pg1@127.0.0.1)5> gproc:reg({n, g, riak_pg1}).
true
(riak_pg1@127.0.0.1)6> gproc:lookup_pids({n, g, riak_pg1}).
[<0.513.0>]
(riak_pg1@127.0.0.1)7> gproc:lookup_pids({n, g, riak_pg1}).
[<0.513.0>]
(riak_pg2@127.0.0.1)4> gproc:lookup_pids({n, g, riak_pg1}).
[<12457.513.0>]
(riak_pg2@127.0.0.1)5>

Now, let’s partition the network.

(riak_pg1@127.0.0.1)8> 
14:46:19.963 [error] ** Node '[email protected]' not responding **
** Removing (timedout) connection **
14:46:19.963 [error] ** Node '[email protected]' not responding **
** Removing (timedout) connection **

And, as soon as we heal the partition:

(riak_pg1@127.0.0.1)8> gproc:lookup_pids({n, g, riak_pg1}).
[<0.513.0>]
(riak_pg1@127.0.0.1)9> gproc:lookup_pids({n, g, riak_pg1}).
[]

What? What happened to the registered name? Did the process die?

(riak_pg1@127.0.0.1)10> self().
<0.513.0>
(riak_pg1@127.0.0.1)11>

Nope.

(riak_pg1@127.0.0.1)12> gproc:lookup_pids({n, g, riak_pg1}).
[<0.513.0>]

But, it eventually returns. This is the behaviour when the leader becomes unavailable and there is a period of time where there is no leader elected.

Permanent data loss is also considered a possibility with some of the data structures, which is also mentioned in the paper. It’s important to note that I couldn’t trigger this type of behaviour with the non-unique property types.

Counters

I didn’t review counters, single, shared, or aggregated for data loss, as I’m saving that topic for a future blog post.

Conclusion

While gproc has been tested very thoroughly using utilities like Erlang QuickCheck, its reliance on gen_leader is problematic. Given numerous forks and implementations of gen_leader trying to address the data loss and netsplit issues, none have been proven to operate correctly and guarantee termination. Above, using a few small simulated tests across a maximum cluster size of three nodes, we’ve observed data loss, failure to handle dynamic membership, and timeout situtations due to deadlock and failed leader election.

In conclusion, gproc seems to operate extremely well when used in a local environment, but any use of it in a distributed system, where you have dynamic cluster management and network partitions, appears to be non-deterministic due to its reliance on gen_leader.

Feedback encouraged!

Thanks to OJ Reeves and Heinz N. Gies for providing valuable feedback on this blog post.

View this story on Hacker News.