Introduction to Apache Zookeeper, backbone of big-data systems

Published on 09 January 2015, in #data-engineering

Apache Kafka, Mesos, Hadoop/YARN, Neo4J, HBase, Solr. All of these services (and many others) are built on top of Apache ZooKeeper. ZooKeeper is also a part of all Hadoop distributions (e.g. Cloudera CDH) and is used in many companies, including LinkedIn, Twitter, Netflix or Yahoo.

So what is ZooKeeper all about and why is it that popular?

In summary, ZooKeeper helps to build a distributed system #

Following the Unix philosophy of small yet powerful tools, Zookeeper is a service that allows distributed processes to coordinate with each other. Instead of developing such a coordination system from scratch, you can re-use ZooKeeper and benefit from its best-practice proven implementation.

Distributed systems use ZooKeeper for service discovery, cluster monitoring, configuration management, leader election or naming service. Also, ZooKeeper provides foundation for building higher-level distributed functions like locks, barriers, queues, two-phase commits, etc.

Or, as the Apache Zookeeper PMC puts it:

If you are doing distributed locking and you are not using ZooKeeper, you are crazy - Camille Fournier, Apache ZooKeeper PMC

ZooKeeper fundamentals: Z-nodes #

ZooKeeper itself is a distributed highly-available service that manages a shared hierarchical name space of data registers, called z-nodes.

Z-nodes very much resemble a filesystem. Each z-node has a path, e.g. /services/product, and data, given as byte[]. However, unlike directories in a filesystem, parent z-nodes can also carry data.

Here is an example of z-nodes hierarchy:

image is missing, sorry

ZooKeeper provides a simple yet powerful API for z-node management. You can create, get, update and delete a z-node, ask if it exists and get children.

The interesting part, however, lies in the consistency guarantees that ZooKeeper provides:

Sequential consistency - Updates are applied in order they are received by ZooKeeper
Update atomicity - Updates are either successful or failed. No partial results
Single system image - A client sees the same view of the service regardless of the ZooKeeper server it connects to
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound

I hope that by now you start to feel the power of ZooKeeper. However, this is not all.

Persistent, ephemeral and sequential z-nodes #

Regarding persistence, ZooKeeper offers two types of z-nodes: persistent and ephemeral.

Persistent z-nodes are the default z-nodes, they exist until they are explicitly deleted.

Ephemeral z-nodes, on the other hand, are attached to a client session. If a client dies, the session is terminated and the ephemeral z-done is deleted. How great is that! You can very easily monitor clieant failures: If a client responsible for performing an exclusive operation fails, the corresponding ephemeral z-node is deleted and other clients are notified to take over the operation.

Both persistent and ephemeral z-nodes can be also marked as sequential.

For sequential z-nodes, ZooKeeper automatically appends a monotically increasing counter to the end of a path. This is greatly helpful for synchronization: if more clients want to get a lock on a resource A, they can each create a sequential z-node on path resource/A. The client getting the lowest number is entitled to the lock.

Watches #

ZooKeeper also offers a mechanism called watches. A watch is just a one-time callback, which is triggered every time a z-node changes. Watches are one-shot - if you need to continually monitor a z-node, you need to reset the watch after each event.

Watches are very useful to if you don't want to periodically poll your z-nodes.

Fault-tolerant, highly-available, read-performant #

Zookeeper can run in both stand-alone and distributed fault-tolerant mode. When running in a cluster, a group of ZooKeeper servers is called an ensemble.

Regarding node failures, ZooKeeper tolerates loss of ensemble minority (there is Paxos under the hood). So if the ensemble consists of three or four machines, ZooKeeper will tolerate failure of only one machine in both of these cases. This is why it is recommended to run the ZooKeeper ensemble on an odd number of hosts, since even number of machines doesn't bring any benefit with respect to node failures.

ZooKeeper data is kept in memory and is backed up to a log for reliability. It is optimized for read dominant workloads, handling up to 50k operations per second.

Clients #

ZooKeeper clients can connect to any of the ensemble members and maintain a connection.

ZooKeeper is build on Java, but there is a pretty solid list of client bindings, including Java, Scala, Node.js, Erlang, Haskell, Python, C#, Go and Ruby.

The most popular client is Apache Curator (former LinkedIn project). Definitely checkout the Curator support for high-level recipes, it's a very interesting read.

Use-cases #

Here is a couple of examples how ZooKeeper is used in practice:

Twitter uses ZooKeeper for service discovery. Each application instance registers itself to ZooKeeper using an ephemeral z-node. In this way, ZooKeeper can maintain an up-to-date list of running instances for each type of service. Clients then query ZooKeeper to locate application instances. In general, ZooKeeper might help you to build a micro-service infrastructure or manage a network of replicated (REST) services.

Apache Solr uses ZooKeeper for leader election, centralized configuration and cluster management. ZooKeeper enables application servers to bootstrap configuration from ZooKeeper as soon as they join the system and to keep the configuration up-to-date.

Cluster monitoring can be implemented as members registering to /members/host-{i} and periodically updating the z-node with their status (load, memory, CPU etc). Each z-node update then triggers an alert to z-node listeners using watches.

Apache Kafka, a popular publish/subscribe system with persistent queues, uses ZooKeeper to store client's last consumed offset, to register Kafka brokers and to help load balance requests among live brokers.

Apache Hadoop uses ZooKeeper for automatic failover of Hadoop HDFS Namenode and for high-availability of YARN resources.

Pretty amazing!

Where to go from here #

I was quite eager to experiment with the ensemble, so I made this automated script which creates and provision a virtual 3-node ensemble on your local machine with one command using Vagrant and Ansible. Check it out.

Also, if you want to dive deeper, I'd recommend to:

Read the Getting started guide on the official ZooKeeper website
Checkout the Zookeeper recipies
Bookmark the curated list of Zookeeper presentations
Read a use-case of building a distributed concurrent queue in Zookeeper
Checkout Grid Computing with Fault-Tolerant Actors and ZooKeeper - eBay tech blog

Happy ZooKeeping!

← Previous post: Analyzing Elastic MapReduce data with Python, Pandas and scikit-learn
→ Next post: How to ingest data from Azure DataMarket into Hadoop

This blog is written by Marcel Krcah, an independent consultant for product-oriented software engineering. If you like what you read, sign up for my newsletter