Rev.
3.
The purpose of Hadoop DFS upgrade is to safely transition the DFS cluster to a new software version. First, the upgrade makes a snapshot of the previous state of the file system in order to preserve it from the new software errors or administrator mistakes during the upgrade process. Second, the old data is converted to the new format (if necessary) making it available to the new software.
Conversion is a part of the Hadoop source code; it will be performed automatically whether the snapshot of the old data is taken or not. Without making a snapshot the converted data will become irreversible.
Taking a snapshot is a choice made by the system administrator to prevent data loss. It leaves him an opportunity to reverse modifications made during or after the upgrade.
Transition of a files system state is an operation (upgrade, rollback, discard) that maintains file system states.
The software upgrade procedure includes the following steps:
The upgrade starts when the cluster is started with –upgrade option.
bin/upgrade-dfs –upgrade <oldFSSID>
The –upgrade option means
If for any reason the upgraded procedure fails or the result is not satisfactory the old data can be restored using the rollback command.
bin/upgrade-dfs –rollback <oldFSSID>
The rollback removes current state and replaces it with the old file system state specified by <oldFSSID>.
If the old version is not desired any more it can be discarded using
bin/upgrade-dfs –discard <oldFSSID>
The discard command removes the specified old file system state.
For the cluster a transition consists of two major stages: local state transition and the distributed stage. During the local stage servers (name- and data-nodes) perform local disk operations, that is rename, create or remove local files and directories. If operation fails during the local stage the original local state of the directories can be restored using the recover command.
bin/upgrade-dfs –recover
For name-nodes the recovery is manual, since we would like to be able to recover namespace state without actually starting the cluster. Data-nodes recover automatically during the startup as in general they don’t know about transitions until the name-node requests one.
HDFS supports several internal
versions: the Hadoop build version (BV), RPC protocol versions, and the storage
layout version (
FSSID is not a software version. It reflects a particular file system state. FSSID is a name for the files system data snapshot, assigned to the snapshot and maintained by the cluster administrator. DFS reports available FSSIDs via web UI and uses them as references to the preserved file system states.
FSSID is introduced in order to let the administrator create a snapshot whenever it is desired independently on the other version changes, even without any changes to the software at all.
The upgrade is mandatory if the new software increments the
The namespace directory root is defined by the <dfs.name.dir> configuration parameter. Current directory structure is:
<dfs.name.dir>
image // namespace directory
fsimage // file containing latest checkpoint
time // file containing latest checkpoint timestamp
edits // file containing namespace edits since last checkpoint
I propose the following changes to the file layout:
version is going to be a common part of the persistent data layout for name-nodes and data-nodes. We will be able to re-use code related to versioning, identification, and locking with that approach.
With the changes above we will be able to accumulate namespace directories for different versions under the same root:
<dfs.name.dir>
data // directory for current version
version // the version file
fsimage // file containing latest checkpoint
edits // file containing namespace edits
data-<
version
fsimage
edits
data-<
version
fsimage
edits
The data-node root directories are defined by the <dfs.data.dir> configuration parameter, which should be verified to be different from <dfs.name.dir> at startup.
Data-node blocks can be stored in multiple directories typically located on different disk volumes. Each directory structure is:
<dfs.data.dir[i]>
data // directory containing blocks
…blocks… // block files and subdirectories
storage // file containing version and storage id
I propose to replace the storage file with the version file and to place the latter inside the data directory. The version‘s file fields in this case are:
· string “DataNode”;
·
data storage version
· storageID – a unique identifier of the data storage within the cluster;
Then we will be able to accumulate data directories for different versions under the same root:
<dfs.data.dir[i]>
data // directory for current version
version // the version file
…blocks… // block files and subdirectories
data-<
version
…blocks…
data-<
version
…blocks…
The name-node upgrade is started by
bin/hadoop namenode –upgrade <oldFSSID>
During upgrade with the specified <oldFSSID> parameter the name-node
Data-nodes upgrade only if the name-node instructs them to upgrade.
During startup a data-node retrieves available FSSIDs and sends the list to the name-node along with all other information required for registration.
If the name-node is in the upgrade mode and if the list of FSSIDs available on the data-node does not contain the required for upgrade <oldFSSID>, the name-node instructs the data-node to upgrade and passes <oldFSSID> to it.
During upgrade the data-node
Previous snapshots of file system states can be removed by starting the name-node with the –discard option.
bin/upgrade-dfs –discard [<oldFSSID>]
The name-node obtains the list of supported version FSSIDs by examining the namespace directory root. During regular cluster operation the data-nodes send versions they possess to the name-node piggybacked on registration and block report requests. The supported versions are displayed in the web UI for each node.
When the name-node starts with –discard option it
When a data-node receives the discard command it
The rollback command reverses the upgrade replacing current version by the one specified.
bin/upgrade-dfs –rollback <oldFSSID>
In order to rollback the name-node and data-nodes should be started with the –rollback option. Due to possible software failure or inconsistency of data the cluster might have become dysfunctional. We are trying to prevent servers from reading persistent data from their current directories. In this case the cluster components should start working directly with the previous healthy state.
When the name-node starts with –rollback option it
When a data-node starts with –rollback option it
Taking into account massive layout changes and also that upgrades with rollback are hard to support for versions with different directory structures, I propose to keep automatic data conversion (as we had until now) from the current version to the next one. No going back (rollback) and force (upgrade) between current and the next versions. The upgrades will be supported once the data is converted to the new format.
For detailed algorithms see FSStateTransition.html document.