Big Data

Monday, June 2, 2014

Sqoop

Great strength of Hadoop platform is its ability to work with data in several different forms.

HDFS can reliably store logs and other data from plethora of sources.

Sqoop:

- Tool that allows users to extract data from relational database into Hadoop for further processing.

- This proocessing can be done with MapReduce programs or other higher level tools such as Hive.

- When final results of analytic pipeline are available, Sqoop can export these results back to the database for consumption by other clients.

Getting Sqoop:

- Sqoop is available in few places.

- Primary home of project is: http://incubator

.apache.org/sqoop/

- This repository contains all the sqoop source code and documentation.

- Repository contains instructions for compiling project.

- If you download a release from Apache, it will be placed in a directory such as /home/

yourname/sqoop-x.y.z/. We’ll call this directory $SQOOP_HOME. You can run Sqoop by

running the executable script $SQOOP_HOME/bin/sqoop.

- If you’ve installed a release from Cloudera, the package will have placed Sqoop’s scripts

in standard locations like /usr/bin/sqoop. You can run Sqoop by simply typing sqoop at

the command line.

Running Sqoop with no arguments does not do much of interest:

% sqoop

Try sqoop help for usage.

- Sqoop is organized as set of tools or commands.

- Without selecting a tool, Sqoop does not know what to do.

- % sqoop help

usage: sqoop COMMAND [ARGS]

Available commands:

codegen Generate code to interact with database records

create-hive-table Import a table definition into Hive

eval Evaluate a SQL statement and display the results

export Export an HDFS directory to a database table

help List available commands

import Import a table from a database to HDFS

import-all-tables Import tables from a database to HDFS

job Work with saved jobs

list-databases List available databases on a server

list-tables List available tables in a database

merge Merge results of incremental imports

metastore Run a standalone Sqoop metastore

version Display version information

As it explains, the help tool can also provide specific usage instructions on a particular

tool, by providing that tool’s name as an argument:

% sqoop help import

usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:

--connect <jdbc-uri> Specify JDBC connect string

--driver <class-name> Manually specify JDBC driver class to use

--hadoop-home <dir> Override $HADOOP_HOME

--help Print usage instructions

-P Read password from console

--password <password> Set authentication password

--username <username> Set authentication username

--verbose Print more information while working

...

An alternate way of running a Sqoop tool is to use a tool-specific script. This script will

be named sqoop-toolname. For example, sqoop-help, sqoop-import, etc. These commands

are identical to running sqoop help or sqoop import.

Sample Import:

- After you install Sqoop, you can use it to import data to Hadoop.

Sqoop imports from databases. The list of databases that it has been tested with includes

MySQL, PostgreSQL, Oracle, SQL Server and DB2. For the examples in this chapter

we’ll use MySQL, which is easy-to-use and available for a large number of platforms.

To install and configure MySQL, follow the documentation at http://dev.mysql.com/

doc/refman/5.1/en/. Chapter 2 (“Installing and Upgrading MySQL”) in particular

should help. Users of Debian-based Linux systems (e.g., Ubuntu) can type sudo aptget

install mysql-client mysql-server.

RedHat users can type sudo yum install

mysql mysql-server.

Example: Creating a new MySQL database schema

% mysql -u root -p

Enter password:

Welcome to the MySQL monitor. Commands end with ; or \g.

Your MySQL connection id is 349

Server version: 5.1.37-1ubuntu5.4 (Ubuntu)

Type 'help;' or '\h' for help. Type '\c' to clear the current input

statement.

mysql> CREATE DATABASE hadoopguide;

Query OK, 1 row affected (0.02 sec)

mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO '%'@'localhost';

Query OK, 0 rows affected (0.00 sec)

mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO ''@'localhost';

Query OK, 0 rows affected (0.00 sec)

mysql> quit;

Bye

The password prompt above asks for your root user password. This is likely the same

as the password for the root shell login. If you are running Ubuntu or another variant

of Linux where root cannot directly log in, then enter the password you picked at

MySQL installation time.

In this session, we created a new database schema called hadoopguide, which we’ll use

throughout this appendix. We then allowed any local user to view and modify the

contents of the hadoopguide schema, and closed our session.

Now let’s log back into the database (not as root, but as yourself this time), and create

a table to import into HDFS

Example: Populating the database

% mysql hadoopguide

Welcome to the MySQL monitor. Commands end with ; or \g.

Your MySQL connection id is 352

Server version: 5.1.37-1ubuntu5.4 (Ubuntu)

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE TABLE widgets(id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,

-> widget_name VARCHAR(64) NOT NULL,

-> price DECIMAL(10,2),

-> design_date DATE,

-> version INT,

-> design_comment VARCHAR(100));

Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO widgets VALUES (NULL, 'sprocket', 0.25, '2010-02-10',

-> 1, 'Connects two gizmos');

Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO widgets VALUES (NULL, 'gizmo', 4.00, '2009-11-30', 4,

-> NULL);

Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO widgets VALUES (NULL, 'gadget', 99.99, '1983-08-13',

-> 13, 'Our flagship product');

Query OK, 1 row affected (0.00 sec)

mysql> quit;

In the above listing, we created a new table called widgets. We’ll be using this fictional

product database in further examples in this chapter. The widgets table contains several

fields representing a variety of data types.

Now let’s use Sqoop to import this table into HDFS:

% sqoop import --connect jdbc:mysql://localhost/hadoopguide \

> --table widgets -m 1

10/06/23 14:44:18 INFO tool.CodeGenTool: Beginning code generation

...

10/06/23 14:44:20 INFO mapred.JobClient: Running job: job_201006231439_0002

10/06/23 14:44:21 INFO mapred.JobClient: map 0% reduce 0%

10/06/23 14:44:32 INFO mapred.JobClient: map 100% reduce 0%

10/06/23 14:44:34 INFO mapred.JobClient: Job complete:

job_201006231439_0002

...

10/06/23 14:44:34 INFO mapreduce.ImportJobBase: Retrieved 3 records.

Sqoop’s import tool will run a MapReduce job that connects to the MySQL database

and reads the table. By default, this will use four map tasks in parallel to speed up the

import process. Each task will write its imported results to a different file, but all in a

common directory. Since we knew that we had only three rows to import in this example,

we specified that Sqoop should use a single map task (-m 1) so we get a single file in HDFS.

We can inspect this file’s contents like so:

% hadoop fs -cat widgets/part-m-00000

1,sprocket,0.25,2010-02-10,1,Connects two gizmos

2,gizmo,4.00,2009-11-30,4,null

3,gadget,99.99,1983-08-13,13,Our flagship product

Generated code:

In addition to writing the contents of the database table to HDFS, Sqoop has also

provided you with a generated Java source file (widgets.java) written to the current local

directory. (After running the sqoop import command above, you can see this file by

running ls widgets.java.)

The generated class (widgets) is capable of holding a single record retrieved from the

imported table. It can manipulate such a record in MapReduce or store it in a SequenceFile

in HDFS. (SequenceFiles written by Sqoop during the import process will store

each imported row in the “value” element of the SequenceFile’s key-value pair format, using the generated class.)

It is likely that you don’t want to name your generated class widgets since each instance

of the class refers to only a single record. We can use a different Sqoop tool to generate

source code without performing an import; this generated code will still examine the

database table to determine the appropriate data types for each field:

% sqoop codegen --connect jdbc:mysql://localhost/hadoopguide \

> --table widgets --class-name Widget

The codegen tool simply generates code; it does not perform the full import. We specified

that we’d like it to generate a class named Widget; this will be written to Widget.java. We also could have specified --class-name and other code-generation ar-guments during the import process we performed earlier. This tool can be used to regenerate code, if you accidentally remove the source file, or generate code with different settings than were used during the import.

Additional Serialization Systems

Database Imports: Deeper look

As mentioned earlier, Sqoop imports a table from a database by running a MapReduce

job that extracts rows from the table, and writes the records to HDFS. How does MapReduce

read the rows? This section explains how Sqoop works under the hood.

At a high level, Figure below demonstrates how Sqoop interacts with both the database

source and Hadoop. Like Hadoop itself, Sqoop is written in Java. Java provides an API

called Java Database Connectivity, or JDBC, that allows applications to access data

stored in an RDBMS as well as inspect the nature of this data. Most database vendors

provide a JDBC driver that implements the JDBC API and contains the necessary code

to connect to their database server.

Before the import can start, Sqoop uses JDBC to examine the table it is to import. It

retrieves a list of all the columns and their SQL data types. These SQL types (VARCHAR,

INTEGER, and so on) can then be mapped to Java data types (String, Integer, etc.), which

will hold the field values in MapReduce applications. Sqoop’s code generator will use

this information to create a table-specific class to hold a record extracted from the table.

The Widget class from earlier, for example, contains the following methods that retrieve

each column from an extracted record:

public Integer get_id();

public String get_widget_name();

public java.math.BigDecimal get_price();

public java.sql.Date get_design_date();

public Integer get_version();

public String get_design_comment();

More critical to the import system’s operation, though, are the serialization methods

that form the DBWritable interface, which allow the Widget class to interact with JDBC:

public void readFields(ResultSet __dbResults) throws SQLException;

public void write(PreparedStatement __dbStmt) throws SQLException;

JDBC’s ResultSet interface provides a cursor that retrieves records from a query; the

readFields() method here will populate the fields of the Widget object with the columns

from one row of the ResultSet’s data. The write() method shown above allows Sqoop

to insert new Widget rows into a table, a process called exporting.

The MapReduce job launched by Sqoop uses an InputFormat that can read sections of

a table from a database via JDBC. The DataDrivenDBInputFormat provided with Hadoop

partitions a query’s results over several map tasks.

Reading a table is typically done with a simple query such as:

SELECT col1,col2,col3,... FROM tableName

But often, better import performance can be gained by dividing this query across multiple

nodes. This is done using a splitting column. Using metadata about the table, Sqoop will guess a good column to use for splitting the table (typically the primary key for the table, if one exists). The minimum and maximum values for the primary key column are retrieved, and then these are used in conjunction with a target number of tasks to determine the queries that each map task should issue.

Controlling the import:

Sqoop does not need to import an entire table at a time. For example, a subset of the

table’s columns can be specified for import. Users can also specify a WHERE clause to

include in queries, which bound the rows of the table to import. For example, if widgets

0 through 99,999 were imported last month, but this month our vendor catalog

included 1,000 new types of widget, an import could be configured with the clause

WHERE id >= 100000; this will start an import job retrieving all the new rows added to

the source database since the previous import run. User-supplied WHERE clauses are

applied before task splitting is performed, and are pushed down into the queries executed

by each task.

Imports and Consistency:

When importing data to HDFS, it is important that you ensure access to a consistent

snapshot of the source data. Map tasks reading from a database in parallel are running

in separate processes. Thus, they cannot share a single database transaction. The best

way to do this is to ensure that any processes that update existing rows of a table are

disabled during the import.

Direct-mode imports

Sqoop’s architecture allows it to choose from multiple available strategies for performing

an import. Most databases will use the DataDrivenDBInputFormat-based approach described above. Some databases offer specific tools designed to extract data quickly. For example, MySQL’s mysqldump application can read from a table with greater throughput than a JDBC channel. The use of these external tools is referred to as direct mode in Sqoop’s documentation. Direct mode must be specifically enabled by the user (via the --direct argument), as it is not as general-purpose as the JDBC approach. (For

example, MySQL’s direct mode cannot handle large objects—CLOB or BLOB columns,

as Sqoop needs to use a JDBC-specific API to load these columns into HDFS.)

For databases that provide such tools, Sqoop can use these to great effect. A directmode

import from MySQL is usually much more efficient (in terms of map tasks and time required) than a comparable JDBC-based import. Sqoop will still launch multiple map tasks in parallel. These tasks will then spawn instances of the mysqldump program and read its output. The effect is similar to a distributed implementation of mkparallel-dump from the Maatkit tool set. Sqoop can also perform direct-mode imports from PostgreSQL.

Working with imported data:

Once data has been imported to HDFS, it is now ready for processing by custom MapReduce

programs.

Text-based imports can be easily used in scripts run with Hadoop Streaming or in MapReduce jobs run with the default TextInputFormat.

To use individual fields of an imported record, though, the field delimiters (and any

escape/enclosing characters) must be parsed and the field values extracted and converted

to the appropriate data types. For example, the id of the “sprocket” widget is

represented as the string "1" in the text file, but should be parsed into an Integer or

int variable in Java. The generated table class provided by Sqoop can automate this

process, allowing you to focus on the actual MapReduce job to run. Each autogenerated

class has several overloaded methods named parse() that operate on the data represented as Text, CharSequence, char[], or other common types.

The MapReduce application called MaxWidgetId (available in the example code) will

find the widget with the highest ID.

The class can be compiled into a JAR file along with Widget.java. Both Hadoop (hadoop-core-version.jar)

and Sqoop (sqoop-version.jar)

will need to be on the classpath for compilation. The class files can then be combined into a JAR file and executed like so:

% jar cvvf widgets.jar *.class

% HADOOP_CLASSPATH=/usr/lib/sqoop/sqoop-version.jar hadoop jar \

> widgets.jar MaxWidgetId -libjars /usr/lib/sqoop/sqoop-version.jar

This command line ensures that Sqoop is on the classpath locally (via $HADOOP_CLASS

PATH), when running the MaxWidgetId.run() method, as well as when map tasks are

running on the cluster (via the -libjars argument).

When run, the maxwidgets path in HDFS will contain a file named part-r-00000 with

the following expected result:

3,gadget,99.99,1983-08-13,13,Our flagship product

It is worth noting that in this example MapReduce program, a Widget object was

emitted from the mapper to the reducer; the auto-generated Widget class implements

the Writable interface provided by Hadoop, which allows the object to be sent via

Hadoop’s serialization mechanism, as well as written to and read from SequenceFiles.

The MaxWidgetId example is built on the new MapReduce API. MapReduce applications

that rely on Sqoop-generated code can be built on the new or old APIs, though some

advanced features (such as working with large objects) are more convenient to use in

the new API.

Imported data and Hive:

For many types of analysis, using a system like Hive to handle

relational operations can dramatically ease the development of the analytic pipeline.

Especially for data originally from a relational data source, using Hive makes a lot of

sense. Hive and Sqoop together form a powerful toolchain for performing analysis.

Suppose we had another log of data in our system, coming from a web-based widget

purchasing system. This may return log files containing a widget id, a quantity, a shipping

address, and an order date.

Here is a snippet from an example log of this type:

1,15,120 Any St.,Los Angeles,CA,90210,2010-08-01

3,4,120 Any St.,Los Angeles,CA,90210,2010-08-01

2,5,400 Some Pl.,Cupertino,CA,95014,2010-07-30

2,7,88 Mile Rd.,Manhattan,NY,10005,2010-07-18

By using Hadoop to analyze this purchase log, we can gain insight into our sales operation.

By combining this data with the data extracted from our relational data source (the widgets table), we can do better. In this example session, we will compute which zip code is responsible for the most sales dollars, so we can better focus our sales team’s operations.

Doing this requires data from both the sales log and the widgets table.

The above table should be in a local file named sales.log for this to work.

First, let’s load the sales data into Hive:

hive> CREATE TABLE sales(widget_id INT, qty INT,

> street STRING, city STRING, state STRING,

> zip INT, sale_date STRING)

> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Time taken: 5.248 seconds

hive> LOAD DATA LOCAL INPATH "sales.log" INTO TABLE sales;

Copying data from file:/home/sales.log

Loading data to table sales

Time taken: 0.188 seconds

Sqoop can generate a Hive table based on a table from an existing relational data source.

Since we’ve already imported the widgets data to HDFS, we can generate the Hive table

definition and then load in the HDFS-resident data:

% sqoop create-hive-table --connect jdbc:mysql://localhost/hadoopguide \

> --table widgets --fields-terminated-by ','

...

10/06/23 18:05:34 INFO hive.HiveImport: OK

10/06/23 18:05:34 INFO hive.HiveImport: Time taken: 3.22 seconds

10/06/23 18:05:35 INFO hive.HiveImport: Hive import complete.

% hive

hive> LOAD DATA INPATH "widgets" INTO TABLE widgets;

Loading data to table widgets

Time taken: 3.265 seconds

When creating a Hive table definition with a specific already-imported dataset in mind,

we need to specify the delimiters used in that dataset. Otherwise, Sqoop will allow Hive

to use its default delimiters (which are different from Sqoop’s default delimiters).

This three-step process of importing data to HDFS, creating the Hive table, and then

loading the HDFS-resident data into Hive can be shortened to one step if you know

that you want to import straight from a database directly into Hive. During an import,

Sqoop can generate the Hive table definition and then load in the data. Had we not

already performed the import, we could have executed this command, which re-creates

the widgets table in Hive, based on the copy in MySQL:

% sqoop import --connect jdbc:mysql://localhost/hadoopguide \

> --table widgets -m 1 --hive-import

Regardless of which data import route we chose, we can now use the widgets data set

and the sales data set together to calculate the most profitable zip code. Let’s do so,

and also save the result of this query in another table for later:

hive> CREATE TABLE zip_profits (sales_vol DOUBLE, zip INT);

hive> INSERT OVERWRITE TABLE zip_profits

> SELECT SUM(w.price * s.qty) AS sales_vol, s.zip FROM SALES s

> JOIN widgets w ON (s.widget_id = w.id) GROUP BY s.zip;

...

3 Rows loaded to zip_profits

hive> SELECT * FROM zip_profits ORDER BY sales_vol DESC;

...

403.71 90210

28.0 10005

20.0 95014

Importing Large Objects:

Most databases provide the capability to store large amounts of data in a single field.

Depending on whether this data is textual or binary in nature, it is usually represented

as a CLOB or BLOB column in the table. These “large objects” are often handled specially

by the database itself. In particular, most tables are physically laid out on disk.
When scanning through rows to determine which rows match the criteria

for a particular query, this typically involves reading all columns of each row from disk.

If large objects were stored “inline” in this fashion, they would adversely affect the

performance of such scans. Therefore, large objects are often stored externally from

their rows. Accessing a large object often requires “opening” it

through the reference contained in the row.

The difficulty of working with large objects in a database suggests that a system such

as Hadoop, which is much better suited to storing and processing large, complex data

objects, is an ideal repository for such information. Sqoop can extract large objects

from tables and store them in HDFS for further processing.

As in a database, MapReduce typically materializes every record before passing it along

to the mapper. If individual records are truly large, this can be very inefficient.

As shown earlier, records imported by Sqoop are laid out on disk in a fashion very

similar to a database’s internal structure: an array of records with all fields of a record

concatenated together. When running a MapReduce program over imported records,

each map task must fully materialize all fields of each record in its input split. If the

contents of a large object field are only relevant for a small subset of the total number

of records used as input to a MapReduce program, it would be inefficient to fully ma-

terialize all these records. Furthermore, depending on the size of the large object, full

materialization in memory may be impossible.

To overcome these difficulties, Sqoop will store imported large objects in a separate

file called a LobFile. The LobFile format can store individual records of very large size

(a 64-bit address space is used). Each record in a LobFile holds a single large object.

The LobFile format allows clients to hold a reference to a record without accessing the

record contents. When records are accessed, this is done through a java.io.Input

Stream (for binary objects) or java.io.Reader (for character-based objects).

When a record is imported, the “normal” fields will be materialized together in a text

file, along with a reference to the LobFile where a CLOB or BLOB column is stored.

For example, suppose our widgets table contained a BLOB field named schematic

holding the actual schematic diagram for each widget.

An imported record might then look like:

2,gizmo,4.00,2009-11-30,4,null,externalLob(lf,lobfile0,100,5011714)

The externalLob(...) text is a reference to an externally stored large object, stored in

LobFile format (lf) in a file named lobfile0, with the specified byte offset and length

inside that file.

When working with this record, the Widget.get_schematic() method would return an

object of type BlobRef referencing the schematic column, but not actually containing

its contents. The BlobRef.getDataStream() method actually opens the LobFile and returns

an InputStream

allowing you to access the schematic

field’s contents.

When running a MapReduce job processing many Widget records, you might need to

access the schematic field of only a handful of records. This system allows you to incur

the I/O costs of accessing only the required large object entries, as individual schematics

may be several megabytes or more of data.

The BlobRef and ClobRef classes cache references to underlying LobFiles within a map

task. If you do access the schematic field of several sequentially ordered records, they

will take advantage of the existing file pointer’s alignment on the next record body.

Performing an export:

In Sqoop, an import refers to the movement of data from a database system into HDFS.

By contrast, an export uses HDFS as the source of data and a remote database as the

destination. In the previous sections, we imported some data and then performed some

analysis using Hive. We can export the results of this analysis to a database for consumption

by other tools.

Before exporting a table from HDFS to a database, we must prepare the database to

receive the data by creating the target table. While Sqoop can infer which Java types

are appropriate to hold SQL data types, this translation does not work in both directions

(for example, there are several possible SQL column definitions that can hold data in

a Java String; this could be CHAR(64), VARCHAR(200), or something else entirely). Consequently,

you must determine which types are most appropriate.

We are going to export the zip_profits table from Hive. We need to create a table in

MySQL that has target columns in the same order, with the appropriate SQL types:

% mysql hadoopguide

mysql> CREATE TABLE sales_by_zip (volume DECIMAL(8,2), zip INTEGER);

Query OK, 0 rows affected (0.01 sec)

Then we run the export command:

% sqoop export --connect jdbc:mysql://localhost/hadoopguide -m 1 \

> --table sales_by_zip --export-dir /user/hive/warehouse/zip_profits \

> --input-fields-terminated-by '\0001'

...

10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Transferred 41 bytes in 10.8947

seconds (3.7633 bytes/sec)

10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Exported 3 records.

Finally, we can verify that the export worked by checking MySQL:

% mysql hadoopguide -e 'SELECT * FROM sales_by_zip'

+--------+-------+

| volume | zip |

+--------+-------+

| 28.00 | 10005 |

| 403.71 | 90210 |

| 20.00 | 95014 |

+--------+-------+

When we created the zip_profits table in Hive, we did not specify any delimiters. So

Hive used its default delimiters: a Ctrl-A character (Unicode 0x0001) between fields,

and a newline at the end of each record. When we used Hive to access the contents of

this table (in a SELECT statement), Hive converted this to a tab-delimited representation

for display on the console. But when reading the tables directly from files, we need to

tell Sqoop which delimiters to use. Sqoop assumes records are newline-delimited by

default, but needs to be told about the Ctrl-A field delimiters. The --input-fieldsterminated-by

argument to sqoop export specified this information. Sqoop supports

several escape sequences (which start with a '\' character) when specifying delimiters.

In the example syntax above, the escape sequence is enclosed in 'single quotes' to

ensure that the shell processes it literally. Without the quotes, the leading backslash

itself may need to be escaped (for example, --input-fields-terminated-by \\0001).

The escape sequences supported by Sqoop are listed in Table 15-1.

Exports: Deeper look

The architecture of Sqoop’s export capability is very similar in nature to how Sqoop

performs imports. Before performing the export, Sqoop picks a strategy

based on the database connect string. For most systems, Sqoop uses JDBC. Sqoop then generates a Java class based on the target table definition. This generated class has the ability to parse records from text files and insert values of the appropriate types into a table (in addition to the ability to read the columns from a ResultSet).

A MapReduce job is then launched that reads the source data files from HDFS, parses the records

using the generated class, and executes the chosen export strategy.

The JDBC-based export strategy builds up batch INSERT statements that will each add

multiple records to the target table. Inserting many records per statement performs

much better than executing many single-row INSERT statements on most database systems.

Separate threads are used to read from HDFS and communicate with the database, to ensure that I/O operations involving different systems are overlapped as much as possible. For MySQL, Sqoop can employ a direct-mode strategy using mysqlimport. Each map task spawns a mysqlimport process that it communicates with via a named FIFO on the local filesystem. Data is then streamed into mysqlimport via the FIFO channel, and from there into the database.

While most MapReduce jobs reading from HDFS pick the degree of parallelism (number

of map tasks) based on the number and size of the files to process, Sqoop’s export system allows users explicit control over the number of tasks. The performance of the export can be affected by the number of parallel writers to the database, so Sqoop uses the CombineFileInputFormat class to group up the input files into a smaller number of map tasks.

Exports and Transactionality:

Due to the parallel nature of the process, an export is often not an atomic operation.

Sqoop will spawn multiple tasks to export slices of the data in parallel. These tasks can

complete at different times, meaning that even though transactions are used inside

tasks, results from one task may be visible before the results of another task. Moreover,

databases often use fixed-size buffers to store transactions. As a result, one transaction

cannot necessarily contain the entire set of operations performed by a task. Sqoop

commits results every few thousand rows, to ensure that it does not run out of memory.

These intermediate results are visible while the export continues. Applications that will

use the results of an export should not be started until the export process is complete,

or they may see partial results.

To solve this problem, Sqoop can export to a temporary staging table, then at the end

of the job—if the export has succeeded—move the staged data into the destination

table in a single transaction. You can specify a staging table with the --staging-table

option. The staging table must already exist and have the same schema as the destination.

It must also be empty, unless the --clear-staging-table

option is also supplied.

Exports and SequenceFiles:

The example export read source data from a Hive table, which is stored in HDFS as a

delimited text file. Sqoop can also export delimited text files that were not Hive tables.

For example, it can export text files that are the output of a MapReduce job.

Sqoop can also export records stored in SequenceFiles to an output table, although

some restrictions apply. A SequenceFile can contain arbitrary record types. Sqoop’s

export tool will read objects from SequenceFiles and send them directly to the Output

Collector, which passes the objects to the database export OutputFormat. To work with

Sqoop, the record must be stored in the “value” portion of the SequenceFile’s key-value

pair format and must subclass the org.apache.sqoop.lib.SqoopRecord abstract class (as

is done by all classes generated by Sqoop).

If you use the codegen tool (sqoop-codegen) to generate a SqoopRecord implementation

for a record based on your export target table, you can then write a MapReduce program,

which populates instances of this class and writes them to SequenceFiles. sqoopexport can then export these SequenceFiles to the table. Another means by which data may be in SqoopRecord instances in SequenceFiles is if data is imported from a database table to HDFS, modified in some fashion, and the results stored in SequenceFiles holding records of the same data type.

In this case, Sqoop should reuse the existing class definition to read data from SequenceFiles,

rather than generate a new (temporary) record container class to perform the export, as is done when converting text-based records to database rows. You can suppress code generation and instead use an existing record class and jar by providing the --class-name and --jar-file arguments to Sqoop. Sqoop will use the specified class,

loaded from the specified jar, when exporting records.

In the following example, we will re-import the widgets table as SequenceFiles, and

then export it back to the database in a different table:

% sqoop import --connect jdbc:mysql://localhost/hadoopguide \

> --table widgets -m 1 --class-name WidgetHolder --as-sequencefile \

> --target-dir widget_sequence_files --bindir .

...

10/07/05 17:09:13 INFO mapreduce.ImportJobBase: Retrieved 3 records.

% mysql hadoopguide

mysql> CREATE TABLE widgets2(id INT, widget_name VARCHAR(100),

-> price DOUBLE, designed DATE, version INT, notes VARCHAR(200));

Query OK, 0 rows affected (0.03 sec)

mysql> exit;

% sqoop export --connect jdbc:mysql://localhost/hadoopguide \

> --table widgets2 -m 1 --class-name WidgetHolder \

> --jar-file widgets.jar --export-dir widget_sequence_files

...

10/07/05 17:26:44 INFO mapreduce.ExportJobBase: Exported 3 records.

During the import, we specified the SequenceFile format, and that we wanted the jar

file to be placed in the current directory (with --bindir), so we can reuse it. Otherwise,

it would be placed in a temporary directory. We then created a destination table for

the export, which had a slightly different schema, albeit one that is compatible with

the original data. We then ran an export that used the existing generated code to read

the records from the SequenceFile and write them to the database.

Friday, May 30, 2014

ZooKeeper

ZooKeeper:

- For building distributed applications using Hadoop distributed coordination service
- When message is sent between two nodes and the network fails, sender does not know whether the receiver got the message.
- It have have gotten before network failed, or it may not have or perhaps the receivers process died.
- Only way sender can find out what happened is to reconnect to receiver and ask it.
- This is partial failure, when we dont even know whether operation failed.

-ZooKeeper gives you a set of tools to build distributed applications that can safely handle partial failures.

Zookeeper has following characteristics:

- It is simple: it is stripped down filesystem that exposes a few simple operations, and some extra abstractions such as ordering and notifications.
- It is expressive: can be used to build large class of coordination data structures and protocols. Ex: distributed queues distributed locks, and leader election among group of peers.
- Highly available: runs on collection of machines and is designed to be highly available, so applications can depend on it. It helps to avoid single point of failure in your system, so you can build a reliable application.
- Loosely coupled interactions: support participants who do not need to know about one another.
- It is library: provides open source, shared repository of implementations and recipes of common coordination patterns.
- High performance: throughput for ZooKeeper cluster has been benchmarked at over 10,000 operations per second.

Installing and Running ZooKeeper:

- When trying out ZooKeeper for the first time, it’s simplest to run it in standalone mode
with a single ZooKeeper server. You can do this on a development machine, for example.
ZooKeeper requires Java 6 to run, so make sure you have it installed first.
You don’t need Cygwin to run ZooKeeper on Windows, since there are Windows versions of the ZooKeeper scripts.
(Windows is supported only as a development platform, not as a
production platform.)

Download a stable release of ZooKeeper from the Apache ZooKeeper releases page at
http://zookeeper.apache.org/releases.html, and unpack the tarball in a suitable location:
% tar xzf zookeeper-x.y.z.tar.gz

ZooKeeper provides a few binaries to run and interact with the service, and it’s convenient
to put the directory containing the binaries on your command-line path:
% export ZOOKEEPER_INSTALL=/home/tom/zookeeper-x.y.z
% export PATH=$PATH:$ZOOKEEPER_INSTALL/bin

Before running the ZooKeeper service, we need to set up a configuration file. The configuration
file is conventionally called zoo.cfg and placed in the conf subdirectory (although you can also place it in /etc/zookeeper, or in the directory defined by the ZOOCFGDIR environment variable, if set). Here’s an example:
tickTime=2000
dataDir=/Users/tom/zookeeper
clientPort=2181

This is a standard Java properties file, and the three properties defined in this example
are the minimum required for running ZooKeeper in standalone mode. Briefly,
tickTime is the basic time unit in ZooKeeper (specified in milliseconds), dataDir is the
local filesystem location where ZooKeeper stores persistent data, and clientPort is the
port the ZooKeeper listens on for client connections (2181 is a common choice). You
should change dataDir to an appropriate setting for your system.
With a suitable configuration defined, we are now ready to start a local ZooKeeper
server:
% zkServer.sh start
To check whether ZooKeeper is running, send the ruok command (“Are you OK?”) to
the client port using nc (telnet works, too):
% echo ruok | nc localhost 2181
imok

Example:

- Imagine group of servers that provide some service to clients.

- We want clients to be able to locate one of the servers, so they can use the service.

- One of the challenges is to maintain list of servers in group.

Group Membership in ZooKeeper:

- One way of understanding ZooKeeper is to think of it as providing high availability filesystem.

- It does not have files and directories, but unified concept of a node, called znode which acts both as container of data(like a file) and container of other znodes( like a directory).

- Znodes form hierarchical namespace, and natural way to build a membership list is to create parent znode with name of the group and child znodes with name of the group members(Servers).

Creating a Group:

Example: A program to create a znode representing a group in ZooKeeper

public class CreateGroup implements Watcher {

private static final int SESSION_TIMEOUT = 5000;

private ZooKeeper zk;

private CountDownLatch connectedSignal = new CountDownLatch(1);

public void connect(String hosts) throws IOException, InterruptedException {

zk = new ZooKeeper(hosts, SESSION_TIMEOUT, this);

connectedSignal.await();

}

@Override

public void process(WatchedEvent event) { // Watcher interface

if (event.getState() == KeeperState.SyncConnected) {

connectedSignal.countDown();

}

public void create(String groupName) throws KeeperException,

InterruptedException {

String path = "/" + groupName;

String createdPath = zk.create(path, null/*data*/, Ids.OPEN_ACL_UNSAFE,

CreateMode.PERSISTENT);

System.out.println("Created " + createdPath);

}

public void close() throws InterruptedException {

zk.close();

}

public static void main(String[] args) throws Exception {

CreateGroup createGroup = new CreateGroup();

createGroup.connect(args[0]);

createGroup.create(args[1]);

createGroup.close();

}

When the main() method is run, it creates a CreateGroup instance and then calls its

connect() method. This method instantiates a new ZooKeeper object, the main class of

the client API and the one that maintains the connection between the client and the

ZooKeeper service. The constructor takes three arguments: the first is the host address

(and optional port, which defaults to 2181) of the ZooKeeper service;

the second is the session timeout in milliseconds (which we set to 5 seconds), explained in more

detail later; and the third is an instance of a Watcher object. The Watcher object receives

callbacks from ZooKeeper to inform it of various events. In this case, CreateGroup is a

Watcher, so we pass this to the ZooKeeper constructor.

When a ZooKeeper instance is created, it starts a thread to connect to the ZooKeeper

service. The call to the constructor returns immediately, so it is important to wait for

the connection to be established before using the ZooKeeper object. We make use of

Java’s CountDownLatch class (in the java.util.concurrent package) to block until the

ZooKeeper instance is ready. This is where the Watcher comes in. The Watcher interface

has a single method:

public void process(WatchedEvent event);

When the client has connected to ZooKeeper, the Watcher receives a call to its

process() method with an event indicating that it has connected. On receiving a connection

event (represented by the Watcher.Event.KeeperState enum, with value

SyncConnected), we decrement the counter in the CountDownLatch, using its count

Down() method. The latch was created with a count of one, representing the number of

events that need to occur before it releases all waiting threads. After calling count

Down() once, the counter reaches zero and the await() method returns.

The connect() method has now returned, and the next method to be invoked on the
CreateGroup is the create() method. In this method, we create a new ZooKeeper znode
using the create() method on the ZooKeeper instance. The arguments it takes are the
path (represented by a string), the contents of the znode (a byte array, null here), an
access control list (or ACL for short, which here is a completely open ACL, allowing
any client to read or write the znode), and the nature of the znode to be created.

Znodes may be ephemeral or persistent. An ephemeral znode will be deleted by the
ZooKeeper service when the client that created it disconnects, either by explicitly disconnecting
or if the client terminates for whatever reason. A persistent znode, on the other hand, is not deleted when the client disconnects. We want the znode representing a group to live longer than the lifetime of the program that creates it, so we create a persistent znode.
The return value of the create() method is the path that was created by ZooKeeper.
We use it to print a message that the path was successfully created. We will see how
the path returned by create() may differ from the one passed into the method when
we look at sequential znodes.
To see the program in action, we need to have ZooKeeper running on the local machine,
and then we can type:
% export CLASSPATH=ch14/target/classes/:$ZOOKEEPER_INSTALL/*:$ZOOKEEPER_INSTALL/lib/*:\
$ZOOKEEPER_INSTALL/conf
% java CreateGroup localhost zoo
Created /zoo

Joining a Group:

- The next part of the application is a program to register a member in a group. Each
member will run as a program and join a group. When the program exits, it should be
removed from the group, which we can do by creating an ephemeral znode that represents
it in the ZooKeeper namespace.
The JoinGroup program implements this idea, and its listing is in Example. The
logic for creating and connecting to a ZooKeeper instance has been refactored into a base
class, ConnectionWatcher, and appears in Example.

Example: A program that joins a group
public class JoinGroup extends ConnectionWatcher {

public void join(String groupName, String memberName) throws KeeperException,
InterruptedException {
String path = "/" + groupName + "/" + memberName;
String createdPath = zk.create(path, null/*data*/, Ids.OPEN_ACL_UNSAFE,
CreateMode.EPHEMERAL);
System.out.println("Created " + createdPath);
}

public static void main(String[] args) throws Exception {
JoinGroup joinGroup = new JoinGroup();
joinGroup.connect(args[0]);
joinGroup.join(args[1], args[2]);

// stay alive until process is killed or thread is interrupted
Thread.sleep(Long.MAX_VALUE);
}
}

Example: A helper class that waits for the connection to ZooKeeper to be established
public class ConnectionWatcher implements Watcher {

private static final int SESSION_TIMEOUT = 5000;
protected ZooKeeper zk;
private CountDownLatch connectedSignal = new CountDownLatch(1);
public void connect(String hosts) throws IOException, InterruptedException {
zk = new ZooKeeper(hosts, SESSION_TIMEOUT, this);
connectedSignal.await();
}

@Override
public void process(WatchedEvent event) {
if (event.getState() == KeeperState.SyncConnected) {
connectedSignal.countDown();
}
}

public void close() throws InterruptedException {
zk.close();
}
}

The code for JoinGroup is very similar to CreateGroup. It creates an ephemeral znode as
a child of the group znode in its join() method, then simulates doing work of some
kind by sleeping until the process is forcibly terminated. Later, you will see that upon
termination, the ephemeral znode is removed by ZooKeeper.

Listing Members in a Group:

- Example: A program to list the members in a group
public class ListGroup extends ConnectionWatcher {

public void list(String groupName) throws KeeperException,
InterruptedException {
String path = "/" + groupName;

try {
List<String> children = zk.getChildren(path, false);
if (children.isEmpty()) {
System.out.printf("No members in group %s\n", groupName);

System.exit(1);
}
for (String child : children) {
System.out.println(child);
}
} catch (KeeperException.NoNodeException e) {
System.out.printf("Group %s does not exist\n", groupName);
System.exit(1);
}
}

public static void main(String[] args) throws Exception {
ListGroup listGroup = new ListGroup();
listGroup.connect(args[0]);
listGroup.list(args[1]);
listGroup.close();
}
}

In the list() method, we call getChildren() with a znode path and a watch flag to
retrieve a list of child paths for the znode, which we print out. Placing a watch on a
znode causes the registered Watcher to be triggered if the znode changes state. Although
we’re not using it here, watching a znode’s children would permit a program to get
notifications of members joining or leaving the group, or of the group being deleted.
We catch KeeperException.NoNodeException, which is thrown in the case when the
group’s znode does not exist.
Let’s see ListGroup in action. As expected, the zoo group is empty, since we haven’t
added any members yet:
% java ListGroup localhost zoo
No members in group zoo
We can use JoinGroup to add some members. We launch them as background processes,
since they don’t terminate on their own (due to the sleep statement):
% java JoinGroup localhost zoo duck &
% java JoinGroup localhost zoo cow &
% java JoinGroup localhost zoo goat &
% goat_pid=$!

The last line saves the process ID of the Java process running the program that adds
goat as a member. We need to remember the ID so that we can kill the process in a
moment, after checking the members:
% java ListGroup localhost zoo
goat
duck
cow
To remove a member, we kill its process:
% kill $goat_pid

And a few seconds later, it has disappeared from the group because the process’s ZooKeeper

session has terminated (the timeout was set to 5 seconds) and its associated
ephemeral node has been removed:
% java ListGroup localhost zoo
duck
cow
Let’s stand back and see what we’ve built here. We have a way of building up a list of
a group of nodes that are participating in a distributed system. The nodes may have no
knowledge of each other. A client that wants to use the nodes in the list to perform
some work, for example, can discover the nodes without them being aware of the client’s
existence.
Finally, note that group membership is not a substitution for handling network errors
when communicating with a node. Even if a node is a group member, communications
with it may fail, and such failures must be handled in the usual ways (retrying, trying
a different member of the group, and so on).

ZooKeeper command-line tools:

ZooKeeper comes with a command-line tool for interacting with the ZooKeeper namespace.
We can use it to list the znodes under the /zoo znode as follows:
% zkCli.sh localhost ls /zoo
Processing ls
WatchedEvent: Server state change. New state: SyncConnected
[duck, cow]

Deleting a group:

- To round off the example, let’s see how to delete a group. The ZooKeeper class provides
a delete() method that takes a path and a version number. ZooKeeper will delete a
znode only if the version number specified is the same as the version number of the
znode it is trying to delete, an optimistic locking mechanism that allows clients to detect
conflicts over znode modification. You can bypass the version check, however, by using
a version number of –1 to delete the znode regardless of its version number.
There is no recursive delete operation in ZooKeeper, so you have to delete child znodes
before parents. This is what we do in the DeleteGroup class, which will remove a group
and all its members.
Example: A program to delete a group and its members
public class DeleteGroup extends ConnectionWatcher {

public void delete(String groupName) throws KeeperException,
InterruptedException {
String path = "/" + groupName;

try {
List<String> children = zk.getChildren(path, false);
for (String child : children) {
zk.delete(path + "/" + child, -1);
}
zk.delete(path, -1);
} catch (KeeperException.NoNodeException e) {
System.out.printf("Group %s does not exist\n", groupName);
System.exit(1);
}
}

public static void main(String[] args) throws Exception {
DeleteGroup deleteGroup = new DeleteGroup();
deleteGroup.connect(args[0]);
deleteGroup.delete(args[1]);
deleteGroup.close();
}
}
Finally, we can delete the zoo group that we created earlier:
% java DeleteGroup localhost zoo
% java ListGroup localhost zoo
Group zoo does not exist

ZooKeeper Service:

ZooKeeper is a highly available, high-performance coordination service. In this section,
we look at the nature of the service it provides: its model, operations, and
implementation.

Data Model:

ZooKeeper maintains a hierarchical tree of nodes called znodes. A znode stores data
and has an associated ACL. ZooKeeper is designed for coordination (which typically
uses small data files), not high-volume data storage, so there is a limit of 1 MB on the
amount of data that may be stored in any znode.
Data access is atomic. A client reading the data stored at a znode will never receive only
some of the data; either the data will be delivered in its entirety, or the read will fail.
Similarly, a write will replace all the data associated with a znode. ZooKeeper guarantees
that the write will either succeed or fail; there is no such thing as a partial write, where only
someof the data written by the client is stored.
ZooKeeper does not support an append operation. These characteristics contrast with HDFS,
which is designed for high-volume data storage, with streaming data access, and provides an append
operation.

Znodes are referenced by paths, which in ZooKeeper are represented as slash-delimited
Unicode character strings, like filesystem paths in Unix. Paths must be absolute, so
they must begin with a slash character. Furthermore, they are canonical, which means
that each path has a single representation, and so paths do not undergo resolution. For
example, in Unix, a file with the path /a/b can equivalently be referred to by the
path /a/./b, since “.” refers to the current directory at the point it is encountered in the
path. In ZooKeeper, “.” does not have this special meaning and is actually illegal as a
path component (as is “..” for the parent of the current directory).
Path components are composed of Unicode characters, with a few restrictions (these
are spelled out in the ZooKeeper reference documentation). The string “zookeeper” is
a reserved word and may not be used as a path component. In particular, ZooKeeper
uses the /zookeeper subtree to store management information, such as information on
quotas.
Note that paths are not URIs, and they are represented in the Java API by a
java.lang.String, rather than the Hadoop Path class (or by the java.net.URI class, for
that matter).
Znodes have some properties that are very useful for building distributed applications,
which we discuss in the following sections.

Ephemeral Znodes:

Znodes can be one of two types: ephemeral or persistent. A znode’s type is set at creation
time and may not be changed later. An ephemeral znode is deleted by ZooKeeper when
the creating client’s session ends. By contrast, a persistent znode is not tied to the client’s
session and is deleted only when explicitly deleted by a client (not necessarily the one
that created it). An ephemeral znode may not have children, not even ephemeral ones.
Even though ephemeral nodes are tied to a client session, they are visible to all clients
(subject to their ACL policy, of course).
Ephemeral znodes are ideal for building applications that need to know when certain
distributed resources are available. The example earlier in this chapter uses ephemeral
znodes to implement a group membership service, so any process can discover the
members of the group at any particular time.

Sequence Numbers:

A sequential znode is given a sequence number by ZooKeeper as a part of its name. If
a znode is created with the sequential flag set, then the value of a monotonically increasing
counter (maintained by the parent znode) is appended to its name.
If a client asks to create a sequential znode with the name /a/b-, for example, then the
znode created may actually have the name /a/b-3.
If, later on, another sequential znode
with the name /a/b- is created, then it will be given a unique name with a larger value
of the counter—for example, /a/b-5. In the Java API, the actual path given to sequential
znodes is communicated back to the client as the return value of the create() call.
Sequence numbers can be used to impose a global ordering on events in a distributed
system, and may be used by the client to infer the ordering. In “A Lock Service”
you will learn how to use sequential znodes to build a shared lock.

Watches:

Watches allow clients to get notifications when a znode changes in some way. Watches
are set by operations on the ZooKeeper service, and are triggered by other operations
on the service. For example, a client might call the exists operation on a znode, placing
a watch on it at the same time. If the znode doesn’t exist, then the exists operation
will return false. If, some time later, the znode is created by a second client, then the
watch is triggered, notifying the first client of the znode’s creation. You will see precisely
which operations trigger others in the next section.
Watchers are triggered only once.
To receive multiple notifications, a client needs to
re register the watch. If the client in the previous example wishes to receive further
notifications for the znode’s existence (to be notified when it is deleted, for example),
it needs to call the exists operation again to set a new watch.

Operations:

- Nine basic operations in ZooKeeper:

Table: Operations in the ZooKeeper service
Operation Description
create Creates a znode (the parent znode must already exist)
delete Deletes a znode (the znode must not have any children)
exists Tests whether a znode exists and retrieves its metadata
getACL, setACL Gets/sets the ACL for a znode
getChildren Gets a list of the children of a znode
getData, setData Gets/sets the data associated with a znode
sync Synchronizes a client’s view of a znode with ZooKeeper

Update operations in ZooKeeper are conditional. A delete or setData operation has to
specify the version number of the znode that is being updated (which is found from a
previous exists call). If the version number does not match, the update will fail. Updates
are a nonblocking operation, so a client that loses an update (because another
process updated the znode in the meantime) can decide whether to try again or take
some other action, and it can do so without blocking the progress of any other process.
Although ZooKeeper can be viewed as a filesystem, there are some filesystem primitives
that it does away with in the name of simplicity. Because files are small and are written
and read in their entirety, there is no need to provide open, close, or seek operations.

Multi-update:

There is another ZooKeeper operation, called multi, which batches together multiple
primitive operations into a single unit that either succeeds or fails in its entirety. The
situation where some of the primitive operations succeed and some fail can never arise.
Multi-update is very useful for building structures in ZooKeeper that maintain some
global invariant. One example is an undirected graph. Each vertex in the graph is naturally
represented as a znode in ZooKeeper, and to add or remove an edge we need to update the two znodes corresponding to its vertices, since each has a reference to the other. If we only used primitive ZooKeeper operations, it would be possible for another client to observe the graph in an inconsistent state where one vertex is connected to another but the reverse connection is absent. Batching the updates on the two znodes into one multi operation ensures that the update is atomic, so a pair of vertices can
never have a dangling connection.

ACLs:

A znode is created with a list of ACLs, which determines who can perform certain
operations on it.
ACLs depend on authentication, the process by which the client identifies itself to
ZooKeeper. There are a few authentication schemes that ZooKeeper provides:
digest
The client is authenticated by a username and password.
sasl
The client is authenticated using Kerberos.
ip
The client is authenticated by its IP address.
Clients may authenticate themselves after establishing a ZooKeeper session. Authentication
is optional, although a znode’s ACL may require an authenticated client, in
which case the client must authenticate itself to access the znode. Here is an example
of using the digest scheme to authenticate with a username and password:
zk.addAuthInfo("digest", "tom:secret".getBytes());
An ACL is the combination of an authentication scheme, an identity for that scheme,
and a set of permissions. For example, if we wanted to give a client with the IP address
10.0.0.1 read access to a znode, we would set an ACL on the znode with the ip scheme,
an ID of 10.0.0.1, and READ permission. In Java, we would create the ACL object as
follows:
new ACL(Perms.READ,
new Id("ip", "10.0.0.1"));
The full set of permissions are listed in Table Note that the exists operation is
not governed by an ACL permission, so any client may call exists to find the Stat for
a znode or to discover that a znode does not in fact exist.

Table: ACL permissions
ACL permission Permitted operations
CREATE create (a child znode)
READ getChildren
getData
WRITE setData
DELETE delete (a child znode)
ADMIN setACL

There are a number of predefined ACLs defined in the ZooDefs.Ids class, including
OPEN_ACL_UNSAFE, which gives all permissions (except ADMIN permission) to everyone.
In addition, ZooKeeper has a pluggable authentication mechanism, which makes it
possible to integrate third-party authentication systems if needed.

Implementation:

The ZooKeeper service can run in two modes. In standalone mode, there is a single
ZooKeeper server, which is useful for testing due to its simplicity (it can even be
embedded in unit tests), but provides no guarantees of high-availability or resilience.
In production, ZooKeeper runs in replicated mode, on a cluster of machines called an
ensemble. ZooKeeper achieves high-availability through replication, and can provide a
service as long as a majority of the machines in the ensemble are up. For example, in a
five-node ensemble, any two machines can fail and the service will still work because
a majority of three remain. Note that a six-node ensemble can also tolerate only two
machines failing, since with three failures the remaining three do not constitute a majority
of the six. For this reason, it is usual to have an odd number of machines in an ensemble.
Conceptually, ZooKeeper is very simple: all it has to do is ensure that every modification
to the tree of znodes is replicated to a majority of the ensemble. If a minority of the
machines fail, then a minimum of one machine will survive with the latest state. The
other remaining replicas will eventually catch up with this state.
The implementation of this simple idea, however, is nontrivial. ZooKeeper uses a protocol
called Zab that runs in two phases, which may be repeated indefinitely:

Phase 1: Leader election
The machines in an ensemble go through a process of electing a distinguished
member, called the leader. The other machines are termed followers. This phase is
finished once a majority (or quorum) of followers have synchronized their state
with the leader.
Phase 2: Atomic broadcast
All write requests are forwarded to the leader, which broadcasts the update to the
followers. When a majority have persisted the change, the leader commits the update,
and the client gets a response saying the update succeeded. The protocol for achieving consensus is designed to be atomic, so a change either succeeds or fails. It resembles a two-phase commit.

Consistency:

Understanding the basis of ZooKeeper’s implementation helps in understanding the
consistency guarantees that the service makes. The terms “leader” and “follower” for
the machines in an ensemble are apt, for they make the point that a follower may lag
the leader by a number of updates. This is a consequence of the fact that only a majority
and not all of the ensemble needs to have persisted a change before it is committed. A
good mental model for ZooKeeper is of clients connected to ZooKeeper servers that
are following the leader. A client may actually be connected to the leader, but it has no
control over this, and cannot even know if this is the case.
Every update made to the znode tree is given a globally unique identifier, called a
zxid (which stands for “ZooKeeper transaction ID”). Updates are ordered, so if zxid
z1 is less than z 2 , then z1 happened before z, according to ZooKeeper, which is the
single authority on ordering in the distributed system.

The following guarantees for data consistency flow from ZooKeeper’s design:

Sequential consistency

Updates from any particular client are applied in the order that they are sent. This

means that if a client updates the znode z to the value a, and in a later operation,

it updates z to the value b, then no client will ever see z with value a after it has

seen it with value b (if no other updates are made to z).

Atomicity

Updates either succeed or fail. This means that if an update fails, no client will ever

see it.

Single system image

A client will see the same view of the system regardless of the server it connects to.

This means that if a client connects to a new server during the same session, it will

not see an older state of the system than the one it saw with the previous server.

When a server fails and a client tries to connect to another in the ensemble, a server

that is behind the one that failed will not accept connections from the client until

it has caught up with the failed server.

Durability

Once an update has succeeded, it will persist and will not be undone. This means

updates will survive server failures.

Timeliness

The lag in any client’s view of the system is bounded, so it will not be out of date

by more than some multiple of tens of seconds. This means that rather than allow

a client to see data that is very stale, a server will shut down, forcing the client to

switch to a more up-to-date server.

For performance reasons, reads are satisfied from a ZooKeeper server’s memory and

do not participate in the global ordering of writes. This property can lead to the appearance

of inconsistent ZooKeeper states from clients that communicate through a

mechanism outside ZooKeeper.

For example, client A updates znode z from a to a’, A tells B to read z, B reads the value

of z as a, not a’. This is perfectly compatible with the guarantees that ZooKeeper makes

(this condition that it does not promise is called “Simultaneously Consistent CrossClient

Views”). To prevent this condition from happening, B should call sync on z,

before reading z’s value. The sync operation forces the ZooKeeper server to which B is

connected to “catch up” with the leader, so that when B reads z’s value it will be the

one that A set (or a later value).

Sessions:

A ZooKeeper client is configured with the list of servers in the ensemble. On startup,

it tries to connect to one of the servers in the list. If the connection fails, it tries another

server in the list, and so on, until it either successfully connects to one of them or fails

if all ZooKeeper servers are unavailable.

Once a connection has been made with a ZooKeeper server, the server creates a new

session for the client. A session has a timeout period that is decided on by the application

that creates it. If the server hasn’t received a request within the timeout period, it may expire the session. Once a session has expired, it may not be reopened, and any ephemeral nodes associated with the session will be lost. Although session expiry is a comparatively rare event, since sessions are long-lived, it is important for applications to handle it.

Sessions are kept alive by the client sending ping requests (also known as heartbeats)

whenever the session is idle for longer than a certain period. (Pings are automatically

sent by the ZooKeeper client library, so your code doesn’t need to worry about maintaining

the session.) The period is chosen to be low enough to detect server failure

(manifested by a read timeout) and reconnect to another server within the session

timeout period.

Failover to another ZooKeeper server is handled automatically by the ZooKeeper client,
and, crucially, sessions (and associated ephemeral znodes) are still valid after another
server takes over from the failed one.
During failover, the application will receive notifications of disconnections and connections
to the service. Watch notifications will not be delivered while the client is disconnected, but they will be delivered when the client successfully reconnects. Also, if the application tries to perform an operation while the client is reconnecting to another server, the operation will fail. This underlines the importance of handling connection loss exceptions in real-world ZooKeeper applications.

Time:

There are several time parameters in ZooKeeper. The tick time is the fundamental period
of time in ZooKeeper and is used by servers in the ensemble to define the schedule on
which their interactions run. Other settings are defined in terms of tick time, or are at
least constrained by it. The session timeout, for example, may not be less than 2 ticks
or more than 20. If you attempt to set a session timeout outside this range, it will be
modified to fall within the range.
A common tick time setting is 2 seconds (2,000 milliseconds). This translates to an
allowable session timeout of between 4 and 40 seconds. There are a few considerations
in selecting a session timeout.
A low session timeout leads to faster detection of machine failure. In the group membership
example, the session timeout is the time it takes for a failed machine to be
removed from the group. Beware of setting the session timeout too low, however, since
a busy network can cause packets to be delayed and may cause inadvertent session
expiry. In such an event, a machine would appear to “flap”: leaving and then rejoining
the group repeatedly in a short space of time.

Applications that create more complex ephemeral state should favor longer session
timeouts, as the cost of reconstruction is higher. In some cases, it is possible to design
the application so it can restart within the session timeout period and avoid session
expiry. (This might be desirable to perform maintenance or upgrades.) Every session
is given a unique identity and password by the server, and if these are passed to ZooKeeper
while a connection is being made, it is possible to recover a session (as long as it hasn’t expired). An application can therefore arrange a graceful shutdown, whereby it stores the session identity and password to stable storage before restarting the process, retrieving the stored session identity and password and recovering the session. You should view this feature as an optimization, which can help avoid expire sessions. It does not remove the need to handle session expiry, which can still occur if a machine
fails unexpectedly, or even if an application is shut down gracefully but does not restart
before its session expires—for whatever reason.

As a general rule, the larger the ZooKeeper ensemble, the larger the session timeout
should be. Connection timeouts, read timeouts, and ping periods are all defined internally
as a function of the number of servers in the ensemble, so as the ensemble grows, these periods decrease. Consider increasing the timeout if you experience frequent connection loss. You can monitor ZooKeeper metrics—such as request latency statistics—using JMX.

States:

The ZooKeeper object transitions through different states in its lifecycle.
You can query its state at any time by using the getState() method:
public States getState()
States is an enum representing the different states that a ZooKeeper object may be in.
(Despite the enum’s name, an instance of ZooKeeper may only be in one state at a time.)
A newly constructed ZooKeeper instance is in the CONNECTING state, while it tries to
establish a connection with the ZooKeeper service. Once a connection is established,
it goes into the CONNECTED state.

A client using the ZooKeeper object can receive notifications of the state transitions by

registering a Watcher object. On entering the CONNECTED state, the watcher receives a

WatchedEvent whose KeeperState value is SyncConnected.

The ZooKeeper instance may disconnect and reconnect to the ZooKeeper service, moving

between the CONNECTED and CONNECTING states. If it disconnects, the watcher receives a

Disconnected event. Note that these state transitions are initiated by the ZooKeeper

instance itself, and it will automatically try to reconnect if the connection is lost.

The ZooKeeper instance may transition to a third state, CLOSED, if either the close()

method is called or the session times out as indicated by a KeeperState of type

Expired. Once in the CLOSED state, the ZooKeeper object is no longer considered to be

alive (this can be tested using the isAlive() method on States) and cannot be reused.

To reconnect to the ZooKeeper service, the client must construct a new ZooKeeper

instance.

Building Applications using ZooKeeper:

A configuration service:

One of the most basic services that a distributed application needs is a configuration

service so that common pieces of configuration information can be shared by machines

in a cluster. At the simplest level, ZooKeeper can act as a highly available store for

configuration, allowing application participants to retrieve or update configuration

files. Using ZooKeeper watches, it is possible to create an active configuration service,

where interested clients are notified of changes in configuration.

Let’s write such a service. We make a couple of assumptions that simplify the implementation

(they could be removed with a little more work). First, the only configuration values we need to store are strings, and keys are just znode paths, so we use a znode to store each key-value pair. Second, there is a single client that performs updates at any one time.

Among other things, this model fits with the idea of a master (such as the namenode in HDFS) that wishes to update information that its workers need to follow.

We wrap the code up in a class called ActiveKeyValueStore:
public class ActiveKeyValueStore extends ConnectionWatcher {
private static final Charset CHARSET = Charset.forName("UTF-8");
public void write(String path, String value) throws InterruptedException,
KeeperException {
Stat stat = zk.exists(path, false);
if (stat == null) {
zk.create(path, value.getBytes(CHARSET), Ids.OPEN_ACL_UNSAFE,
CreateMode.PERSISTENT);
} else {
zk.setData(path, value.getBytes(CHARSET), -1);
}
}
}

The contract of the write() method is that a key with the given value is written to
ZooKeeper. It hides the difference between creating a new znode and updating an existing
znode with a new value, by testing first for the znode using the exists operation and then performing the
appropriate operation. The other detail worth mentioning is the need to convert the string value to a byte array, for which we just use the getBytes() method with a UTF-8 encoding.
To illustrate the use of the ActiveKeyValueStore, consider a ConfigUpdater class that
updates a configuration property with a value. The listing appears in Example 14-6.

Example: An application that updates a property in ZooKeeper at random times
public class ConfigUpdater {

public static final String PATH = "/config";

private ActiveKeyValueStore store;
private Random random = new Random();

public ConfigUpdater(String hosts) throws IOException, InterruptedException {
store = new ActiveKeyValueStore();
store.connect(hosts);
}

public void run() throws InterruptedException, KeeperException {
while (true) {
String value = random.nextInt(100) + "";
store.write(PATH, value);
System.out.printf("Set %s to %s\n", PATH, value);
TimeUnit.SECONDS.sleep(random.nextInt(10));
}
}

public static void main(String[] args) throws Exception {
ConfigUpdater configUpdater = new ConfigUpdater(args[0]);
configUpdater.run();
}
}

The program is simple. A ConfigUpdater has an ActiveKeyValueStore that connects to
ZooKeeper in ConfigUpdater’s constructor. The run() method loops forever, updating
the /config znode at random times with random values.
Next, let’s look at how to read the /config configuration property. First, we add a read
method to ActiveKeyValueStore:
public String read(String path, Watcher watcher) throws InterruptedException,
KeeperException {
byte[] data = zk.getData(path, watcher, null/*stat*/);
return new String(data, CHARSET);
}
The getData() method of ZooKeeper takes the path, a Watcher, and a Stat object. The
Stat object is filled in with values by getData(), and is used to pass information back
to the caller. In this way, the caller can get both the data and the metadata for a znode,
although in this case, we pass a null Stat because we are not interested in the metadata.
As a consumer of the service, ConfigWatcher (see Example 14-7) creates an ActiveKey
ValueStore, and after starting, calls the store’s read() method (in its displayConfig()
method) to pass a reference to itself as the watcher. It displays the initial value of the
configuration that it reads.

Example: An application that watches for updates of a property in ZooKeeper and prints them
to the console
public class ConfigWatcher implements Watcher {

private ActiveKeyValueStore store;

public ConfigWatcher(String hosts) throws IOException, InterruptedException {
store = new ActiveKeyValueStore();
store.connect(hosts);
}

public void displayConfig() throws InterruptedException, KeeperException {
String value = store.read(ConfigUpdater.PATH, this);
System.out.printf("Read %s as %s\n", ConfigUpdater.PATH, value);
}
@Override
public void process(WatchedEvent event) {
if (event.getType() == EventType.NodeDataChanged) {
try {
displayConfig();
} catch (InterruptedException e) {
System.err.println("Interrupted. Exiting.");
Thread.currentThread().interrupt();

} catch (KeeperException e) {
System.err.printf("KeeperException: %s. Exiting.\n", e);
}
}
}

public static void main(String[] args) throws Exception {
ConfigWatcher configWatcher = new ConfigWatcher(args[0]);
configWatcher.displayConfig();

// stay alive until process is killed or thread is interrupted
Thread.sleep(Long.MAX_VALUE);
}
}

When the ConfigUpdater updates the znode, ZooKeeper causes the watcher to fire with
an event type of EventType.NodeDataChanged. ConfigWatcher acts on this event in its
process() method by reading and displaying the latest version of the config.
Because watches are one-time signals, we tell ZooKeeper of the new watch each time
we call read() on ActiveKeyValueStore—this ensures we see future updates. Furthermore,
we are not guaranteed to receive every update, since between the receipt of the watch event and the next read, the znode may have been updated, possibly many times, and as the client has no watch registered during that period, it is not notified. For the configuration service, this is not a problem because clients care only about the latest value of a property, as it takes precedence over previous values, but in general you
should be aware of this potential limitation.
Let’s see the code in action. Launch the ConfigUpdater in one terminal window:
% java ConfigUpdater localhost
Set /config to 79
Set /config to 14
Set /config to 78
Then launch the ConfigWatcher in another window immediately afterward:
% java ConfigWatcher localhost
Read /config as 79
Read /config as 14
Read /config as 78

Resilient ZooKeeper Application

The first of the Fallacies of Distributed Computing states that “The network is reliable.”
As they stand, the programs so far have been assuming a reliable network, so when they run on a real network, they can fail in several ways. Let’s examine possible failure modes and what we can do to correct them so that our programs are resilient in the face of failure.

Every ZooKeeper operation in the Java API declares two types of exception in its throws
clause: InterruptedException and KeeperException.
InterruptedException
An InterruptedException is thrown if the operation is interrupted. There is a standard
Java mechanism for canceling blocking methods, which is to call interrupt() on the
thread from which the blocking method was called. A successful cancellation will result
in an InterruptedException. ZooKeeper adheres to this standard, so you can cancel a
ZooKeeper operation in this way. Classes or libraries that use ZooKeeper should usually
propagate the InterruptedException so that their clients can cancel their operations.
An InterruptedException does not indicate a failure, but rather that the operation has
been canceled, so in the configuration application example, it is appropriate to propagate
the exception, causing the application to terminate.

KeeperException
A KeeperException is thrown if the ZooKeeper server signals an error or if there is a
communication problem with the server. There are various subclasses of
KeeperException for different error cases. For example, KeeperException.NoNodeExcep
tion is a subclass of KeeperException that is thrown if you try to perform an operation
on a znode that doesn’t exist.
Every subclass of KeeperException has a corresponding code with information about
the type of error. For example, for KeeperException.NoNodeException the code is Keep
erException.Code.NONODE (an enum value).
There are two ways then to handle KeeperException: either catch KeeperException and
test its code to determine what remedying action to take, or catch the equivalent
KeeperException subclasses and perform the appropriate action in each catch block.
KeeperExceptions fall into three broad categories.

A state exception occurs when the operation fails because it cannot be

State exceptions.

applied to the znode tree. State exceptions usually happen because another process is

mutating a znode at the same time. For example, a setData operation with a version

number will fail with a KeeperException.BadVersionException if the znode is updated

by another process first, since the version number does not match. The programmer is

usually aware that this kind of conflict is possible and will code to deal with it.

Some state exceptions indicate an error in the program, such as KeeperExcep

tion.NoChildrenForEphemeralsException, which is thrown when trying to create a child

znode of an ephemeral znode.

Recoverable exceptions.

Recoverable exceptions are those from which the application can

recover within the same ZooKeeper session. A recoverable exception is manifested by

KeeperException.ConnectionLossException, which means that the connection to

ZooKeeper has been lost. ZooKeeper will try to reconnect, and in most cases the reconnection

will succeed and ensure that the session is intact.

However, ZooKeeper cannot tell whether the operation that failed with KeeperExcep

tion.ConnectionLossException was applied. This is an example of partial failure (which

we introduced at the beginning of the chapter). The onus is therefore on the programmer

to deal with the uncertainty, and the action that should be taken depends on the application.

At this point, it is useful to make a distinction between idempotent and nonidempotent operations.

An idempotent operation is one that may be applied one or more times with the same result, such as a read request or an unconditional setData. These can simply be retried.

A nonidempotent operation cannot be indiscriminately retried, as the effect of applying

it multiple times is not the same as applying it once. The program needs a way of

detecting whether its update was applied by encoding information in the znode’s path

name or its data. We shall discuss how to deal with failed nonidempotent operations

in “Recoverable exceptions” on page 518, when we look at the implementation of a

lock service.

Unrecoverable exceptions.

In some cases, the ZooKeeper session becomes invalid—

perhaps because of a timeout or because the session was closed (both get a KeeperEx

ception.SessionExpiredException), or perhaps because authentication failed (Keeper

Exception.AuthFailedException). In any case, all ephemeral nodes associated with the

session will be lost, so the application needs to rebuild its state before reconnecting to

ZooKeeper.

A reliable configuration service

Going back to the write() method in ActiveKeyValueStore, recall that it is composed

of an exists operation followed by either a create or a setData:

public void write(String path, String value) throws InterruptedException,

KeeperException {

Stat stat = zk.exists(path, false);

if (stat == null) {

zk.create(path, value.getBytes(CHARSET), Ids.OPEN_ACL_UNSAFE,

CreateMode.PERSISTENT);

} else {

zk.setData(path, value.getBytes(CHARSET), -1);

}

Taken as a whole, the write() method is idempotent, so we can afford to unconditionally

retry it. Here’s a modified version of the write() method that retries in a loop.

It is set to try a maximum number of retries (MAX_RETRIES) and sleeps for

RETRY_PERIOD_SECONDS between each attempt:

public void write(String path, String value) throws InterruptedException,

KeeperException {

int retries = 0;

while (true) {

try {

Stat stat = zk.exists(path, false);

if (stat == null) {

zk.create(path, value.getBytes(CHARSET), Ids.OPEN_ACL_UNSAFE,

CreateMode.PERSISTENT);

} else {

zk.setData(path, value.getBytes(CHARSET), stat.getVersion());

}

} catch (KeeperException.SessionExpiredException e) {

throw e;

} catch (KeeperException e) {

if (retries++ == MAX_RETRIES) {

throw e;

}

// sleep then retry

TimeUnit.SECONDS.sleep(RETRY_PERIOD_SECONDS);

}

The code is careful not to retry KeeperException.SessionExpiredException, since when

a session expires, the ZooKeeper object enters the CLOSED state, from which it can never

reconnect (refer to Figure 14-3). We simply rethrow the exception

and let the caller create a new ZooKeeper instance, so that the whole write() method can be retried. A

simple way to create a new instance is to create a new ConfigUpdater (which we’ve

actually renamed ResilientConfigUpdater) to recover from an expired session:

public static void main(String[] args) throws Exception {

while (true) {

try {

ResilientConfigUpdater configUpdater =

new ResilientConfigUpdater(args[0]);

configUpdater.run();

} catch (KeeperException.SessionExpiredException e) {

// start a new session

} catch (KeeperException e) {

// already retried, so exit

e.printStackTrace();

break;

}

An alternative way of dealing with session expiry would be to look for a KeeperState

of type Expired in the watcher (that would be the ConnectionWatcher in the example

here), and create a new connection when this is detected. This way, we would just keep

retrying in the write() method, even if we got a KeeperException.SessionExpiredExcep

tion, since the connection should eventually be reestablished. Regardless of the precise

mechanics of how we recover from an expired session, the important point is that it is

a different kind of failure from connection loss and needs to be handled differently.

This is just one strategy for retry handling—there are many others, such as using exponential

backoff where the period between retries is multiplied by a constant each

time. The org.apache.hadoop.io.retry package in Hadoop Core is a set of utilities for

adding retry logic into your code in a reusable way, and it may be helpful for building

ZooKeeper applications.

Lock Service:

A distributed lock is a mechanism for providing mutual exclusion between a collection

of processes. At any one time, only a single process may hold the lock. Distributed locks

can be used for leader election in a large distributed system, where the leader is the

process that holds the lock at any point in time.

To implement a distributed lock using ZooKeeper, we use sequential znodes to impose

an order on the processes vying for the lock. The idea is simple: first designate a lock

znode, typically describing the entity being locked on, say /leader; then clients that want

to acquire the lock create sequential ephemeral znodes as children of the lock znode.

At any point in time, the client with the lowest sequence number holds the lock. For

example, if two clients create znodes at around the same time, /leader/lock-1

and /leader/lock-2, then the client that created /leader/lock-1 holds the lock, since its

znode has the lowest sequence number. The ZooKeeper service is the arbiter of order,

since it assigns the sequence numbers.

The lock may be released simply by deleting the znode /leader/lock-1; alternatively, if

the client process dies, it will be deleted by virtue of it being an ephemeral znode. The

client that created /leader/lock-2 will then hold the lock, since it has the next lowest

sequence number. It will be notified that it has the lock by creating a watch that fires

when znodes go away.

The pseudocode for lock acquisition is as follows:

1. Create an ephemeral sequential znode named lock- under the lock znode and re-

member its actual path name (the return value of the create operation).

2. Get the children of the lock znode and set a watch.

3. If the path name of the znode created in 1 has the lowest number of the children

returned in 2, then the lock has been acquired. Exit.

4. Wait for the notification from the watch set in 2 and go to step 2.

The herd effect

Although this algorithm is correct, there are some problems with it. The first problem

is that this implementation suffers from the herd effect. Consider hundreds or thousands

of clients, all trying to acquire the lock. Each client places a watch on the lock znode

for changes in its set of children. Every time the lock is released, or another

process starts the lock acquisition process, the watch fires and every client receives a

notification. The “herd effect” refers to a large number of clients being notified of the

same event, when only a small number of them can actually proceed. In this case, only

one client will successfully acquire the lock, and the process of maintaining and sending

watch events to all clients causes traffic spikes, which put pressure on the ZooKeeper

servers.

To avoid the herd effect, the condition for notification needs to be refined. The key

observation for implementing locks is that a client needs to be notified only when the

child znode with the previous sequence number goes away, not when any child znode

is deleted (or created). In our example, if clients have created the znodes /leader/

lock-1, /leader/lock-2, and /leader/lock-3, then the client holding /leader/lock-3 only

needs to be notified when /leader/lock-2 disappears. It does not need to be notified

when /leader/lock-1 disappears or when a new znode /leader/lock-4 is added.

Recoverable exceptions

Another problem with the lock algorithm as it stands is that it doesn’t handle the case

when the create operation fails due to connection loss. Recall that in this case we do

not know if the operation succeeded or failed. Creating a sequential znode is a

nonidempotent operation, so we can’t simply retry, since if the first create had

succeeded, we would have an orphaned znode that would never be deleted (until the

client session ended, at least). Deadlock would be the unfortunate result.

The problem is that after reconnecting, the client can’t tell whether it created any of

the child znodes. By embedding an identifier in the znode name, if it suffers a connection

loss, it can check to see whether any of the children of the lock node have its identifier

in their name. If a child contains its identifier, it knows that the create operation succeeded,

and it shouldn’t create another child znode. If no child has the identifier in its name, then the client can safely create a new sequential child znode.

The client’s session identifier is a long integer that is unique for the ZooKeeper service

and therefore ideal for the purpose of identifying a client across connection loss events.

The session identifier can be obtained by calling the getSessionId() method on the

ZooKeeper Java class.

The ephemeral sequential znode should be created with a name of the form lock-

<sessionId>-, so that when the sequence number is appended by ZooKeeper, the name

becomes lock-<sessionId>-<sequenceNumber>. The sequence numbers are unique to the

parent, not to the name of the child, so this technique allows the child znodes to identify

their creators as well as impose an order of creation.

Unrecoverable exceptions

If a client’s ZooKeeper session expires, the ephemeral znode created by the client will

be deleted, effectively relinquishing the lock or at least forfeiting the client’s turn to

acquire the lock. The application using the lock should realize that it no longer holds

the lock, clean up its state, and then start again by creating a new lock object and trying

to acquire it. Notice that it is the application that controls this process, not the lock

implementation, since it cannot second-guess how the application needs to clean up

its state.

Implementation

Implementing a distributed lock correctly is a delicate matter, since accounting for all

of the failure modes is nontrivial. ZooKeeper comes with a production-quality lock

implementation in Java called WriteLock that is very easy for clients to use.

BookKeeper and Hedwig

BookKeeper is a highly-available and reliable logging service. It can be used to provide

write-ahead logging, which is a common technique for ensuring data integrity in storage

systems. In a system using write-ahead logging, every write operation is written to the

transaction log before it is applied. Using this procedure, we don’t have to write the

data to permanent storage after every write operation because in the event of a system

failure, the latest state may be recovered by replaying the transaction log for any writes

that had not been applied.

BookKeeper clients create logs called ledgers, and each record appended to a ledger is

called a ledger entry, which is simply a byte array. Ledgers are managed by bookies,

which are servers that replicate the ledger data. Note that ledger data is not stored in

ZooKeeper, only metadata is.

Traditionally, the challenge has been to make systems that use write-ahead logging

robust in the face of failure of the node writing the transaction log. This is usually done

by replicating the transaction log in some manner. Hadoop’s HDFS namenode, for

instance, writes its edit log to multiple disks, one of which is typically an NFS mounted

disk. However, in the event of failure of the primary, failover is still manual. By providing

logging as a highly available service, BookKeeper promises to make failover

transparent, since it can tolerate the loss of bookie servers. (In the case of HDFS HighAvailability, described on 50, a BookKeeper-based edit log will remove the requirement

for using NFS for shared storage.)

Hedwig is a topic-based publish-subscribe system built on BookKeeper. Thanks to its

ZooKeeper underpinnings, Hedwig is a highly available service and guarantees message

delivery even if subscribers are offline for extended periods of time.

BookKeeper is a ZooKeeper subproject, and you can find more information on how to

use it, and Hedwig, at http://zookeeper.apache.org/bookkeeper/.

ZooKeeper in Production:

In production, you should run ZooKeeper in replicated mode. Here we will cover some

of the considerations for running an ensemble of ZooKeeper servers. However, this

section is not exhaustive, so you should consult the ZooKeeper Administrator’s

Guide for detailed up-to-date instructions, including supported platforms, recommended

hardware, maintenance procedures, and configuration properties.

Resilience and Performance:

ZooKeeper machines should be located to minimize the impact of machine and network

failure. In practice, this means that servers should be spread across racks, power supplies,

and switches, so that the failure of any one of these does not cause the ensemble to lose a majority of its servers.

For applications that require low-latency service (on the order of a few milliseconds),

it is important to run all the servers in an ensemble in a single data center. Some use

cases don’t require low-latency responses, however, which makes it feasible to spread

servers across data centers (at least two per data center) for extra resilience. Example

applications in this category are leader election and distributed coarse-grained locking,

both of which have relatively infrequent state changes so the overhead of a few tens of

milliseconds that inter-data center messages incurs is not significant to the overall

functioning of the service.

ZooKeeper is a highly available system, and it is critical that it can perform its functions

in a timely manner. Therefore, ZooKeeper should run on machines that are dedicated

to ZooKeeper alone. Having other applications contend for resources can cause ZooKeeper’s

performance to degrade significantly.

Configure ZooKeeper to keep its transaction log on a different disk drive from its snapshots.

By default, both go in the directory specified by the dataDir property, but by

specifying a location for dataLogDir, the transaction log will be written there. By having

its own dedicated device (not just a partition), a ZooKeeper server can maximize the

rate at which it writes log entries to disk, which it does sequentially, without seeking.

Since all writes go through the leader, write throughput does not scale by adding servers,

so it is crucial that writes are as fast as possible.

If the process swaps to disk, performance will be adversely affected. This can be avoided

by setting the Java heap size to less than the amount of unused physical memory on

the machine. The ZooKeeper scripts will source a file called java.env from its configu-

ration directory, and this can be used to set the JVMFLAGS environment variable to set

the heap size (and any other desired JVM arguments).

Configuration

Each server in the ensemble of ZooKeeper servers has a numeric identifier that is unique

within the ensemble, and must fall between 1 and 255. The server number is specified

in plain text in a file named myid in the directory specified by the dataDir property.

Setting each server number is only half of the job. We also need to give all the servers

all the identities and network locations of the others in the ensemble. The ZooKeeper

configuration file must include a line for each server, of the form:

server.n=hostname:port:port

The value of n is replaced by the server number. There are two port settings: the first

is the port that followers use to connect to the leader, and the second is used for leader

election. Here is a sample configuration for a three-machine replicated ZooKeeper

ensemble:

tickTime=2000

dataDir=/disk1/zookeeper

dataLogDir=/disk2/zookeeper

clientPort=2181

initLimit=5

syncLimit=2

server.1=zookeeper1:2888:3888

server.2=zookeeper2:2888:3888

server.3=zookeeper3:2888:3888

Servers listen on three ports: 2181 for client connections; 2888 for follower connections,

if they are the leader; and 3888 for other server connections during the leader election

phase. When a ZooKeeper server starts up, it reads the myid file to determine which

server it is, then reads the configuration file to determine the ports it should listen on,

as well as the network addresses of the other servers in the ensemble.

Clients connecting to this ZooKeeper ensemble should use zookeeper1:2181,zoo

keeper2:2181,zookeeper3:2181 as the host string in the constructor for the ZooKeeper

object.

In replicated mode, there are two extra mandatory properties: initLimit and

syncLimit, both measured in multiples of tickTime.

initLimit is the amount of time to allow for followers to connect to and sync with the

leader. If a majority of followers fail to sync within this period, then the leader renounces

its leadership status and another leader election takes place. If this happens often (and

you can discover if this is the case because it is logged), it is a sign that the setting is too

low.

syncLimit is the amount of time to allow a follower to sync with the leader. If a follower

fails to sync within this period, it will restart itself. Clients that were attached to this

follower will connect to another one.

These are the minimum settings needed to get up and running with a cluster of ZooKeeper

servers.

There are, however, more configuration options, particularly for tuning performance, documented in the ZooKeeper Administrator’s Guide.