Friday, May 23, 2014

Administering Hadoop

10: Administering Hadoop 

HDFS

Persistent Data Structures:

- As an admin, it is invaluable to have basic understanding of how the components of HDFS- namenode secondary namenode and datanodes organize their persistent data on disk.

Namenode directory structure:

- Newly formatted namenode create following directory structure:
- ${dfs.name.dir}/current/VERSION
                       /edits
                       /fsimage
                       /fstime
- VERSION file is Java properties file that contains information about the version of HDFS that is running. Here are the contents of typical file:

#Tue Mar 10 19:21:36 GMT 2009
namespaceID=134368441
cTime=0
storageType=NAME_NODE
layoutVersion=-18

- layoutVersion is negative integer that define the version of HDFSs persistent data structures.
- Whenever layout changes, version number is decremented.
- When this happens, HDFS needs to be upgraded, since newer namenode will not operate if its storage layout is an older version.
- namespaceID is unique identifier for filesystem, which is created when filesystem is first formatted.
- cTime property marks the creation time of namenodes storage.
- storageType indicates that this storage directory contains data structures for namenode.

Filesystem image and edit log:

- When filesystem client performs write operation, it is first recorded in edit log.
- Namenode also has an in memory representation of filesystem metadata, which it updates after the edit log has been modified.
- The in-memory metadata is used to serve read requests.
- fsimage file is persistent checkpoint of filesystem metadata.
- If namenode fails, then the latest state of its metadata can be reconstructed by loading the fsimage from disk into memory, then applying each of the operations in the edit log.
- edits file would grow without bound.
- If namenode were restarted, it would take long time to apply each of the operations in its edit log.
- During this time, filesystem would be offline, which is generally undesirable.
- Solution is to run secondary namenode, whose purpose is to produce checkpoints of the primarys inmemory filesystem metadata.
- The checkpointing process proceeds as follows:
Secondary asks the primaru to roll its edits file, so new edits go to new file
Secondary retreives fsimage and edits from primary(using HTTP GET).
Secondary loads fsimage into memory, applies each operation from edits, then creates a new consolidated fsimage file.
Secondary sends the new fsimage back to the primary(using HTTP POST).
Primary replaces old fsimage with new one from secondary, and old edits file with new one it started in step1.

- At end of process, primary has an up to date fsimage file and shorter edits file.
- It is possible for an administrator to run this process manually while namenode is in safe mode, using the hadoop dfsadmin -saveNamespace command.

- This procedure makes it clear why the secondary has similar memory requirements to primary(since it loads the fsimage into memory), which is the reason that secondary needs dedicated machine on large clusters.
- schedule for checkpointing is controlled by two configuration parameters.
secondary namenode checkpoints every hour(fs.checkpoint.period in seconds) or sooner if the edit log has reached 64MB(fs.checkpoint.size in bytes) which it checks every five minutes.

Secondary namenode directory structure

- Useful side effect of the checkpointing process is that secondary has checkpoint at end of process, which can be found in subdirectory called previous.checkpoint.
- This can be used as source for making backups of namenodes metadata:
${fs.checkpoint.dir}/current/VERSION
                            /edits
                            /fsimage
                            /fstime
                    /previous.checkpoint/VERSION
                                        /edits
                                        /fsimage
                                        /fstime

- Layout of this directory and of secondarys current directory is identical to the namenodes.
- In event of total namenode failure, it allows recovery from secondary namenode.
- This can be achieved either by copying relevant storage directory to new namenode or if secondary is taking over as new primary namenode by using -importCheckpoint option when starting the namenode daemon.

Datanode directory structure:

- Datanodes do not need to be explicitly formatted, since they create their storage directories automatically on startup.
- Below are key files and directories:
${dfs.data.dir}/current/VERSION
                       /blk_<id_1>
                       /blk_<id_1>.meta
                       /blk_<id_2>
                       /blk_<id_2>.meta
                       /...
                       /blk_<id_64>
                       /blk_<id_64>.meta
                       /subdir0/
                       /subdir1/
                       /...
                       /subdir63/ 

- Datanodes VERSION file is very similar to namenodes:
#Tue Mar 10 21:32:31 GMT 2009
namespaceID=134368441
storageID=DS-547717739-172.16.85.1-50010-1236720751627
cTime=0
storageType=DATA_NODE
layoutVersion=-18

Safe Mode:

- When the namenode starts, first thing it does is load its image file(fsimage) into memory and apply the edits from edit log(edits).
- Once it has reconstructed a consistent in memory image of filesystem metadata, it creates a new fsimage file and empty edit log.
- Only at this point does the namenode start listening for RPC and HTTP requests.
- However, namenode is running in safe mode, which means that it offers only read-only view of filesystem to clients.
- Safe mode is needed to give the datanodes time to check in to the namenode with their block lists, so the namenode can be informed of enough block locations to run the filesystem effectively.
- If the namenode didnt wait for enough datanodes to check in, then it would start the process of replicating blocks to new datanodes.
- While in safe mode, namenode does not issue any block replication or deletion instructions to datanodes.
- Safe mode is exited when minimal replication condition is reached, plus an extension time of 30 seconds.
The minimum replication condition is when 99.9% of the blocks in whole filesystem meet their minimum replication level.
Entering and leaving safe mode:

- To see whether namenode is in safe mode, you can use the dfsadmin command:
% hadoop dfsadmin -safemode get
Safe mode is ON

- Sometimes you want to wait for namenode to exit safe mode before carrying out a command, particularly in scripts. The wait option achieves this:

hadoop dfsadmin -safemode wait
# command to read or write a file

To enter safe mode:

% hadoop dfsadmin -safemode enter
Safe mode is ON

To leave safe mode:

% hadoop dfsadmin -safemode leave
Safe mode is OFF

Audit Logging:

- Audit logging is performed using log4j logging at the INFO level, and in default configuration it is disabled.
as log threshold is set to WARN in log4j.properties.

log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=WARN

- You can enable audit logging by replacing WARN with INFO.

2009-03-13 07:11:22,982 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
 audit: ugi=tom,staff,admin  ip=/127.0.0.1  cmd=listStatus  src=/user/tom  dst=null
 perm=null         

Tools:

 dfsadmin
- multipurpose tool for finding information about state of HDFS, as well as performing administration operations on HDFS.
- It is invoked as hadoop dfsadmin and requires superuser privileges.
Filesystem check(fsck)
- to check health of files in HDFS. Tool looks for blocks that are missing from all datanodes, as well as under or over replicated blocks.
- Below is example of checking whole filesystem for small cluster:

% hadoop fsck /
......................Status: HEALTHY
 Total size:    511799225 B
 Total dirs:    10
 Total files:    22
 Total blocks (validated):    22 (avg. block size 23263601 B)
 Minimally replicated blocks:    22 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    0 (0.0 %)
 Mis-replicated blocks:        0 (0.0 %)
 Default replication factor:    3
 Average block replication:    3.0
 Corrupt blocks:        0
 Missing replicas:        0 (0.0 %)
 Number of data-nodes:        4
 Number of racks:        1

- fsck retreives all of its information from namenode, it does not communicate with any datanodes to actually retreive any block data.

Most of the output from fsck is self-explanatory, but here are some of the conditions it
looks for:

Over-replicated blocks

These are blocks that exceed their target replication for the file they belong to.
Over-replication is not normally a problem, and HDFS will automatically delete
excess replicas.

Under-replicated blocks

These are blocks that do not meet their target replication for the file they belong
to. HDFS will automatically create new replicas of under-replicated blocks until
they meet the target replication. You can get information about the blocks being
replicated (or waiting to be replicated) using hadoop dfsadmin -metasave.
Misreplicated blocks
These are blocks that do not satisfy the block replica placement policy (see “Replica
Placement” on page 74). For example, for a replication level of three in a multirack
cluster, if all three replicas of a block are on the same rack, then the block is misreplicated
since the replicas should be spread across at least two racks for
resilience. HDFS will automatically re-replicate misreplicated blocks so that they
satisfy the rack placement policy.

Corrupt blocks

These are blocks whose replicas are all corrupt. Blocks with at least one noncorrupt
replica are not reported as corrupt; the namenode will replicate the noncorrupt
replica until the target replication is met.
Missing replicas
These are blocks with no replicas anywhere in the cluster.

- Corrupt or missing blocks are biggest cause for concern, as it means data has been lost.
- fsck leaves files with corrupt or missing blocks, but you can tell it to perform one of the following actions:

• Move the affected files to the /lost+found directory in HDFS, using the -move option.
Files are broken into chains of contiguous blocks to aid any salvaging efforts you
may attempt.
• Delete the affected files, using the -delete option. Files cannot be recovered after
being deleted.

Finding blocks for a file:

- fsck tool provides easy way to find out which blocks are in particular file:

% hadoop fsck /user/tom/part-00007 -files -blocks -racks
/user/tom/part-00007 25582428 bytes, 1 block(s):  OK
0. blk_-3724870485760122836_1035 len=25582428 repl=3 [/default-rack/10.251.43.2:50010,
/default-rack/10.251.27.178:50010, /default-rack/10.251.123.163:50010]
This says that the file /user/tom/part-00007 is made up of one block and shows the
datanodes where the blocks are located. The fsck options used are as follows:
• The -files option shows the line with the filename, size, number of blocks, and
its health (whether there are any missing blocks).
• The -blocks option shows information about each block in the file, one line per
block.
• The -racks option displays the rack location and the datanode addresses for each
block.
Running hadoop fsck without any arguments displays full usage instructions.

Datanode block scanner:

- Every datanode runs a block scanner, which periodically verifies all the blocks stored
on the datanode. This allows bad blocks to be detected and fixed before they are read
by clients. The DataBlockScanner maintains a list of blocks to verify and scans them one
by one for checksum errors. The scanner employs a throttling mechanism to preserve
disk bandwidth on the datanode.
- Blocks are periodically verified every three weeks to guard against disk errors over time
(this is controlled by the dfs.datanode.scan.period.hours property, which defaults to
504 hours). Corrupt blocks are reported to the namenode to be fixed.

- We can get block verification report for datanode by visiting datanodes web interface at http://datanode:50075/blockScannerReport. Heres an example of report which should be self explanatory:

Total Blocks                 :  21131
Verified in last hour        :     70
Verified in last day         :   1767
Verified in last week        :   7360
Verified in last four weeks  :  20057
Verified in SCAN_PERIOD      :  20057
Not yet verified             :   1074
Verified since restart       :  35912
Scans since restart          :   6541
Scan errors since restart    :      0
Transient scan errors        :      0
Current scan rate limit KBps :   1024
Progress this period         :    109%
Time left in cur period      :  53.08%

Here is a snippet of the block list (lines are split to fit the
page):
blk_6035596358209321442    : status : ok     type : none   scan time : 0              
    not yet verified
blk_3065580480714947643    : status : ok     type : remote scan time : 1215755306400  
    2008-07-11 05:48:26,400
blk_8729669677359108508    : status : ok     type : local  scan time : 1215755727345  
    2008-07-11 05:55:27,345

Balancer:

- Overtime distribution of blocks across datanodes can become unbalanced. An unbalanced cluster can affect locality for MapReduce,and it puts greater strain on highly utilized datanodes, so its best avoided.
- Balancer program in Hadoop daemon that re distributes blocks by moving them from over utilized datanodes to under utilized datanodes, while adhering to block replica placement policy that makes data loss unlikely by placing block replicas on different racks.
- You can start balancer with:
% start-balancer.sh
-  Here is the
output from a short run on a small cluster:
Time Stamp      Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
Mar 18, 2009 5:23:42 PM  0                 0 KB           219.21 MB          150.29 MB
Mar 18, 2009 5:27:14 PM  1            195.24 MB            22.45 MB          150.29 MB
The cluster is balanced. Exiting...
Balancing took 6.072933333333333 minutes

The balancer is designed to run in the background without unduly taxing the cluster
or interfering with other clients using the cluster. It limits the bandwidth that it uses to
copy a block from one node to another. The default is a modest 1 MB/s, but this can
be changed by setting the dfs.balance.bandwidthPerSec property in hdfs-site.xml, specified
in bytes.

Monitoring:

- Purpose of monitoring is to detect when cluster is not providing expected level of service,
- Master daemons are most important to monitor: namenodes and the jobtracker.
- Failure of datanodes and tasktrackers is to be expected, particularly in large clusters.

Logging:

Setting log levels:

- When debugging the problem, it is very convenient to be able to change log level temporarily for particular component in the system.
- For example, to enable debug logging for the JobTracker class, we would visit the jobtracker’s
web UI at http://jobtracker-host:50030/logLevel and set the log name
org.apache.hadoop.mapred.JobTracker to level DEBUG.
The same thing can be achieved from the command line as follows:
% hadoop daemonlog -setlevel jobtracker-host:50030 \
  org.apache.hadoop.mapred.JobTracker DEBUG

- To make persistent change to the log level, simply change the log4j.properties file in configuration directory.
- In this case, line to add is:
log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG

Getting stack traces:

- Hadoop daemons expose a web page (/stacks in the web UI) that produces a thread
dump for all running threads in the daemon’s JVM. For example, you can get a thread
dump for a jobtracker from http://jobtracker-host:50030/stacks.

Metrics:

- HDFS and MapReduce daemons collect information about events and measurements that are collectively known as metrics.
- For ex, datanodes collect following metrics: number of bytes written, number of blocks replicated, number of read requests from clients(both local and remote).
- Metrics belong to context and Hadoop currently uses dfs, mapred, rpc and jvm contexts,
- Hadoop daemons usually collect metrics under several contexts. For ex, datanodes collect mertics for dfs, rpc, and jvm contexts.

How do Metrics differ from Counters?

- Metrics are collected by Hadoop daemons, whereas counters are collected for MapReduce tasks and aggregation for whole job.
- Metrics are for administrators and counters are for MapReduce users.
- Context defines the unit of publication: you can choose to publish the dfs context but not the jvm context.
- Metrics are configured in conf/hadoop-metrics.properties file and by default, all contexts are configured so they do not publish their metrics.
- Contents of default configuration file:

dfs.class=org.apache.hadoop.metrics.spi.NullContext
mapred.class=org.apache.hadoop.metrics.spi.NullContext
jvm.class=org.apache.hadoop.metrics.spi.NullContext
rpc.class=org.apache.hadoop.metrics.spi.NullContext
Each line in this file configures a different context and specifies the class that handles
the metrics for that context. The class must be an implementation of the MetricsCon
text interface; and, as the name suggests, the NullContext class neither publishes nor
updates metrics.

FileContext:

- writes metrics to local file.
- It exposes two configuration properties:
fileName, specifies absolute name of file to write on,
period: for time interval between file updates.

 For example, to dump the “jvm” context to a file, we alter its configuration to be the following:
jvm.class=org.apache.hadoop.metrics.file.FileContext
jvm.fileName=/tmp/jvm_metrics.log

 Here are two lines of output from the logfile, split over several lines to fit the page:

jvm.metrics: hostName=ip-10-250-59-159, processName=NameNode, sessionId=, ↵
gcCount=46, gcTimeMillis=394, logError=0, logFatal=0, logInfo=59, logWarn=1, ↵
memHeapCommittedM=4.9375, memHeapUsedM=2.5322647, memNonHeapCommittedM=18.25, ↵
memNonHeapUsedM=11.330269, threadsBlocked=0, threadsNew=0, threadsRunnable=6, ↵
threadsTerminated=0, threadsTimedWaiting=8, threadsWaiting=13
jvm.metrics: hostName=ip-10-250-59-159, processName=SecondaryNameNode, sessionId=, ↵
gcCount=36, gcTimeMillis=261, logError=0, logFatal=0, logInfo=18, logWarn=4, ↵
memHeapCommittedM=5.4414062, memHeapUsedM=4.46756, memNonHeapCommittedM=18.25, ↵
memNonHeapUsedM=10.624519, threadsBlocked=0, threadsNew=0, threadsRunnable=5, ↵
threadsTerminated=0, threadsTimedWaiting=4, threadsWaiting=2

GangliaContext:

- Is an open source distributed monitoring system for very large clusters.
- Ganglia itself collects metrics, such as CPU and memory usage; by using GangliaContext, you can inject Hadoop metrics into Ganglia.
- GangliaContext has one required property, servers, which takes a space- and/or
comma-separated list of Ganglia server host-port pairs. Further details on configuring
this context can be found on the Hadoop wiki.

CompositeContext:

- Allows you to output same set of metrics to multiple contexts, such as FileContext and a GangliaContext.
The configuration is slightly tricky and is
best shown by an example:
jvm.class=org.apache.hadoop.metrics.spi.CompositeContext
jvm.arity=2
jvm.sub1.class=org.apache.hadoop.metrics.file.FileContext
jvm.fileName=/tmp/jvm_metrics.log
jvm.sub2.class=org.apache.hadoop.metrics.ganglia.GangliaContext
jvm.servers=ip-10-250-59-159.ec2.internal:8649

- arity: specifies number of subcontexts, in this case there are two.
- The property names for each subcontext are modified to have a part specifying
the subcontext number, hence jvm.sub1.class and jvm.sub2.class.

Java Management Extensions:

- Standard Java API for monitoring and managing applications.
- Hadoop includes several managed beans(MBeans), which expose Hadoop metrics to JMX aware applications.
 Here is an example of using the “jmxquery” command-line tool (and Nagios
plug-in, available from http://code.google.com/p/jmxquery/) to retrieve the number of
under-replicated blocks:

% ./check_jmx -U service:jmx:rmi:///jndi/rmi://namenode-host:8004/jmxrmi -O \
hadoop:service=NameNode,name=FSNamesystemState -A UnderReplicatedBlocks \
-w 100 -c 1000 -username monitorRole -password secret
JMX OK - UnderReplicatedBlocks is 0

Maintainence:

Routine Administration Procedures:

Metadata Backups:

- A straightforward way to make backups is to write a script to periodically archive the
secondary namenode’s previous.checkpoint subdirectory (under the directory defined
by the fs.checkpoint.dir property) to an offsite location. The script should additionally
test the integrity of the copy. This can be done by starting a local namenode daemon
and verifying that it has successfully read the fsimage and edits files into memory (by
scanning the namenode log for the appropriate success message, for example).

Data backups:

- Although HDFS is designed to store data reliably, data loss can occur, just like in any
storage system, and thus a backup strategy is essential. With the large data volumes
that Hadoop can store, deciding what data to back up and where to store it is a challenge.
The key here is to prioritize your data. 
The highest priority is the data that cannot be regenerated and that is critical to the business; however, data that is straightforward to regenerate, or essentially disposable because it is of limited business value, is the
lowest priority, and you may choose not to make backups of this category of data.

The distcp tool is ideal for making backups to other HDFS clusters (preferably running on a different version of the software, to guard against loss due to bugs in HDFS) or other Hadoop filesystems (such as S3 or KFS), since it can copy files in parallel. 

Commissioning and Decommissioning Nodes:

As an administrator of a Hadoop cluster, you will need to add or remove nodes from time to time. For example, to grow the storage available to a cluster, you commission new nodes. Conversely, sometimes you may wish to shrink a cluster, and to do so, you decommission nodes. It can sometimes be necessary to decommission a node if it is misbehaving, perhaps because it is failing more often than it should or its performance is noticeably slow.
Nodes normally run both a datanode and a tasktracker, and both are typically commissioned or decommissioned in tandem.

Commissioning new nodes:

- Although commissioning a new node can be as simple as configuring the hdfssite.xml
file to point to the namenode and the mapred-site.xml file.
- Datanodes that are permitted to connect to the namenode are specified in a file whose
name is specified by the dfs.hosts property. The file resides on the namenode’s local
filesystem, and it contains a line for each datanode, specified by network address (as
reported by the datanode—you can see what this is by looking at the namenode’s web
UI). If you need to specify multiple network addresses for a datanode, put them on one
line, separated by whitespace.

Similarly, tasktrackers that may connect to the jobtracker are specified in a file whose
name is specified by the mapred.hosts property. In most cases, there is one shared file,
referred to as the include file, that both dfs.hosts and mapred.hosts refer to, since nodes
in the cluster run both datanode and tasktracker daemons.

To add new nodes to the cluster:
1. Add the network addresses of the new nodes to the include file.
2. Update the namenode with the new set of permitted datanodes using this
command:
% hadoop dfsadmin -refreshNodes
3. Update the jobtracker with the new set of permitted tasktrackers using:
% hadoop mradmin -refreshNodes
4. Update the slaves file with the new nodes, so that they are included in future operations
performed by the Hadoop control scripts.
5. Start the new datanodes and tasktrackers.
6. Check that the new datanodes and tasktrackers appear in the web UI.
HDFS will not move blocks from old datanodes to new datanodes to balance the cluster.

Decommissioning Old nodes:

- The decommissioning process is controlled by an exclude file, which for HDFS is set
by the dfs.hosts.exclude property and for MapReduce by the mapred.hosts.exclude
property. It is often the case that these properties refer to the same file. The exclude file
lists the nodes that are not permitted to connect to the cluster.

To remove nodes from the cluster:
1. Add the network addresses of the nodes to be decommissioned to the exclude file.
Do not update the include file at this point.
2. Update the namenode with the new set of permitted datanodes, with this
command:
% hadoop dfsadmin -refreshNodes
3. Update the jobtracker with the new set of permitted tasktrackers using:
% hadoop mradmin -refreshNodes
4. Go to the web UI and check whether the admin state has changed to “Decommission
5. When all the datanodes report their state as “Decommissioned,” then all the blocks
have been replicated. Shut down the decommissioned nodes.
6. Remove the nodes from the include file, and run:
% hadoop dfsadmin -refreshNodes
% hadoop mradmin -refreshNodes
7. Remove the nodes from the slaves file.

Upgrades:

- Upgrading an HDFS and MapReduce cluster requires careful planning. The most important
consideration is the HDFS upgrade. If the layout version of the filesystem has changed, then the upgrade
will automatically migrate the filesystem data and metadata to a format that is compatible with the new version. As with any procedure that involves data migration, there is a risk of data loss, so you should be sure that both your data and metadata is backed up.
Part of the planning process should include a trial run on a small test cluster with a copy of data that you can afford to lose. A trial run will allow you to familiarize yourself with the process, customize it to your particular cluster configuration and toolset, and iron out any snags before running the upgrade procedure on a Production cluster. A test cluster also has the benefit of being available to test client upgrades on. 
Upgrading a cluster when the filesystem layout has not changed is fairly
straightforward: install the new versions of HDFS and MapReduce on the cluster (and
on clients at the same time), shut down the old daemons, update configuration files,
then start up the new daemons and switch clients to use the new libraries. This process
is reversible, so rolling back an upgrade is also straightforward.
After every successful upgrade, you should perform a couple of final cleanup steps:
• Remove the old installation and configuration files from the cluster.
• Fix any deprecation warnings in your code and configuration.

HDFS data and metadata upgrades:

If you use the procedure just described to upgrade to a new version of HDFS and it
expects a different layout version, then the namenode will refuse to run. A message like
the following will appear in its log:
File system image contains an old layout version -16.
An upgrade to version -18 is required.
Please restart NameNode with -upgrade option.
The most reliable way of finding out whether you need to upgrade the filesystem is by
performing a trial on a test cluster.
An upgrade of HDFS makes a copy of the previous version’s metadata and data. Doing
an upgrade does not double the storage requirements of the cluster, as the datanodes
use hard links to keep two references (for the current and previous version) to the same
block of data. This design makes it straightforward to roll back to the previous version
of the filesystem, should you need to. You should understand that any changes made
to the data on the upgraded system will be lost after the rollback completes.
You can keep only the previous version of the filesystem: you can’t roll back several
versions. Therefore, to carry out another upgrade to HDFS data and metadata, you will
need to delete the previous version, a process called finalizing the upgrade. Once an
upgrade is finalized, there is no procedure for rolling back to a previous version.
In general, you can skip releases when upgrading (for example, you can upgrade from
release 0.18.3 to 0.20.0 without having to upgrade to a 0.19.x release first), but in some
cases, you may have to go through intermediate releases. The release notes make it clear
when this is required.
You should only attempt to upgrade a healthy filesystem. Before running the upgrade,
do a full fsck (see “Filesystem check (fsck)” on page 345). As an extra precaution, you
can keep a copy of the fsck output that lists all the files and blocks in the system, so
you can compare it with the output of running fsck after the upgrade.
It’s also worth clearing out temporary files before doing the upgrade, both from the
MapReduce system directory on HDFS and local temporary files.

With these preliminaries out of the way, here is the high-level procedure for upgrading
a cluster when the filesystem layout needs to be migrated:
1. Make sure that any previous upgrade is finalized before proceeding with another
upgrade.
2. Shut down MapReduce and kill any orphaned task processes on the tasktrackers.
3. Shut down HDFS and backup the namenode directories.
4. Install new versions of Hadoop HDFS and MapReduce on the cluster and on
clients.
5. Start HDFS with the -upgrade option.
6. Wait until the upgrade is complete.
7. Perform some sanity checks on HDFS.
8. Start MapReduce.
9. Roll back or finalize the upgrade (optional).
While running the upgrade procedure, it is a good idea to remove the Hadoop scripts
from your PATH environment variable. This forces you to be explicit about which version
of the scripts you are running. It can be convenient to define two environment variables
for the new installation directories; in the following instructions, we have defined
OLD_HADOOP_INSTALL and NEW_HADOOP_INSTALL.

Start the upgrade.

To perform the upgrade, run the following command (this is step 5 in
the high-level upgrade procedure):
% $NEW_HADOOP_INSTALL/bin/start-dfs.sh -upgrade
This causes the namenode to upgrade its metadata, placing the previous version in a
new directory called previous:
${dfs.name.dir}/current/VERSION
                       /edits
                       /fsimage
                       /fstime
               /previous/VERSION
                        /edits
                        /fsimage
                        /fstime
Similarly, datanodes upgrade their storage directories, preserving the old copy in a
directory called previous.
Wait until the upgrade is complete.
The upgrade process is not instantaneous, but you can
check the progress of an upgrade using dfsadmin (upgrade events also appear in the
daemons’ logfiles, step 6):
% $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress status
Upgrade for version -18 has been completed.
Upgrade is not finalized.
Check the upgrade.
This shows that the upgrade is complete. At this stage, you should run
some sanity checks (step 7) on the filesystem (check files and blocks using fsck, basic
file operations). You might choose to put HDFS into safe mode while you are running
some of these checks (the ones that are read-only) to prevent others from making
changes.
Roll back the upgrade (optional).
If you find that the new version is not working correctly,
you may choose to roll back to the previous version (step 9). This is only possible if
you have not finalized the upgrade.
First, shut down the new daemons:
% $NEW_HADOOP_INSTALL/bin/stop-dfs.sh
Then start up the old version of HDFS with the -rollback option:
% $OLD_HADOOP_INSTALL/bin/start-dfs.sh -rollback

This command gets the namenode and datanodes to replace their current storage
directories with their previous copies. The filesystem will be returned to its previous
state.
Finalize the upgrade (optional).
When you are happy with the new version of HDFS, you
can finalize the upgrade (step 9) to remove the previous storage directories.

This step is required before performing another upgrade:
% $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -finalizeUpgrade
% $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress status
There are no upgrades in progress.
HDFS is now fully upgraded to the new version.



No comments:

Post a Comment