DFS Upgrade Proposal.

Rev. 3.

 

The purpose of Hadoop DFS upgrade is to safely transition the DFS cluster to a new software version. First, the upgrade makes a snapshot of the previous state of the file system in order to preserve it from the new software errors or administrator mistakes during the upgrade process. Second, the old data is converted to the new format (if necessary) making it available to the new software.

Conversion is a part of the Hadoop source code; it will be performed automatically whether the snapshot of the old data is taken or not. Without making a snapshot the converted data will become irreversible.

Taking a snapshot is a choice made by the system administrator to prevent data loss. It leaves him an opportunity to reverse modifications made during or after the upgrade.

File System State Transition commands.

Transition of a files system state is an operation (upgrade, rollback, discard) that maintains file system states.

The software upgrade procedure includes the following steps:

  • Stop DFS cluster
  • Install new version of Hadoop software.
  • Start DFS cluster with –upgrade option.
  • Optionally discard old file system state if upgrade is successful.
  • Optionally rollback the upgrade if upgrade was not satisfactory.

The upgrade starts when the cluster is started with –upgrade option.

bin/upgrade-dfs –upgrade <oldFSSID>

The –upgrade option means

  • that the current state of the file system will be preserved under the administrator specified file system state id (FSSID) <oldFSSID>, which is used for external referencing the state, and
  • that the old data will be converted to the new format if necessary and will become current state of the file system.

If for any reason the upgraded procedure fails or the result is not satisfactory the old data can be restored using the rollback command.

bin/upgrade-dfs –rollback <oldFSSID>

The rollback removes current state and replaces it with the old file system state specified by <oldFSSID>.

If the old version is not desired any more it can be discarded using

bin/upgrade-dfs –discard <oldFSSID>

The discard command removes the specified old file system state.

For the cluster a transition consists of two major stages: local state transition and the distributed stage. During the local stage servers (name- and data-nodes) perform local disk operations, that is rename, create or remove local files and directories. If operation fails during the local stage the original local state of the directories can be restored using the recover command.

bin/upgrade-dfs –recover

For name-nodes the recovery is manual, since we would like to be able to recover namespace state without actually starting the cluster. Data-nodes recover automatically during the startup as in general they don’t know about transitions until the name-node requests one.

Versions vs. File System State.

HDFS supports several internal versions: the Hadoop build version (BV), RPC protocol versions, and the storage layout version (LV). All of them are independently maintained internally by Hadoop developers and reflect changes made to the software.

FSSID is not a software version. It reflects a particular file system state. FSSID is a name for the files system data snapshot, assigned to the snapshot and maintained by the cluster administrator.  DFS reports available FSSIDs via web UI and uses them as references to the preserved file system states.

FSSID is introduced in order to let the administrator create a snapshot whenever it is desired independently on the other version changes, even without any changes to the software at all.

Mandatory upgrade.

The upgrade is mandatory if the new software increments the LV number, which indicates the data format has changed and a conversion should be done. Otherwise, the upgrade is optional, although it is recommended to upgrade every time the new software is installed. If the upgrade is mandatory and a node is started without the –upgrade option the server should fail and report it needs to be upgraded.

Namespace directory structure.

The namespace directory root is defined by the <dfs.name.dir> configuration parameter. Current directory structure is:

<dfs.name.dir>

      image             // namespace directory

                        fsimage     // file containing latest checkpoint

                        time        // file containing latest checkpoint timestamp

      edits // file containing namespace edits since last checkpoint

I propose the following changes to the file layout:

  • move edits file inside the image directory;
  • change name of the image directory to data;
  • add new file version, which contains the following fields:
    • a string “NameNode”;
    • data storage version  LV;
    • namespaceID, which identifies the file system instance;
    • timestamp of the latest checkpoint.
  • fsimage format is changing so that it does not contain LV and namespaceID any more, which are in version now;
  • file time can be removed with its contents migrated to version.

version is going to be a common part of the persistent data layout for name-nodes and data-nodes. We will be able to re-use code related to versioning, identification, and locking with that approach.

With the changes above we will be able to accumulate namespace directories for different versions under the same root:

<dfs.name.dir>

      data              // directory for current version

            version     // the version file

                        fsimage     // file containing latest checkpoint

            edits       // file containing namespace edits

      data-<LV>-<FSSID1> // backup directory 1

            version

                        fsimage

            edits

      data-<LV>-<FSSID2> // backup directory 2

            version

                        fsimage

            edits

Data-node directory structure.

The data-node root directories are defined by the <dfs.data.dir> configuration parameter, which should be verified to be different from <dfs.name.dir> at startup.

Data-node blocks can be stored in multiple directories typically located on different disk volumes. Each directory structure is:

<dfs.data.dir[i]>

      data        // directory containing blocks

            …blocks…    // block files and subdirectories

      storage     // file containing version and storage id

I propose to replace the storage file with the version file and to place the latter inside the data directory.  The version‘s file fields in this case are:

·       string “DataNode”;

·       data storage version  LV;

·       storageID – a unique identifier of the data storage within the cluster;

Then we will be able to accumulate data directories for different versions under the same root:

<dfs.data.dir[i]>

      data              // directory for current version

            version     // the version file

            …blocks…    // block files and subdirectories

      data-<LV>-<FSSID1> // backup directory 1

            version

            …blocks…

      data-<LV>-<FSSID2> // backup directory 2

            version

            …blocks…

The name-node upgrade.

The name-node upgrade is started by

bin/hadoop namenode –upgrade <oldFSSID>

During upgrade with the specified <oldFSSID> parameter the name-node

  1. Starts in safe mode and reads the latest checkpoint converting it in memory to the new format. The old checkpoint remains untouched in the namespace directory.
  2. Then it renames current namespace directory data to upgrade-<LV>-<oldFSSID>.
  3. Current in-memory namespace image is check-pointed to the namespace directory data (re-created). We first save the namespace image into fsimage and then the version file as an indication the name-node upgrade has been successfully completed. The version file should exist in data directory and be locked at all times to prevent accidental startup of other name-nodes in the same directory.
  4. If at startup the name-node finds a directory named upgrade-* it knows the upgrade has not been completed, and proceeds with the upgrade.
  5. Data-nodes notify the name-node of FSSIDs they possess (piggybacked on registration and block-report requests).
  6. If the list does not contain <oldFSSID> the name-node asks the data-node to upgrade. Registration and block report requests are ignored by the name-node until the upgrade is confirmed by a particular data-node.
  7. The upgraded data-nodes start sending block reports. When the safe mode conditions are satisfied the cluster upgrade is considered to be completed.
  8. Renames directory upgrade-<LV>-<oldFSSID> to data-<LV>-<oldFSSID>.
  9. Data-nodes that failed to upgrade are explicitly logged and reported by the name-node.
  10. The name-node UI reflects upgraded data-nodes and lists available FSSIDs per node.

Data-node upgrade.

Data-nodes upgrade only if the name-node instructs them to upgrade.

During startup a data-node retrieves available FSSIDs and sends the list to the name-node along with all other information required for registration.

If the name-node is in the upgrade mode and if the list of FSSIDs available on the data-node does not contain the required for upgrade <oldFSSID>, the name-node instructs the data-node to upgrade and passes <oldFSSID> to it.

During upgrade the data-node

  1. Renames current data directory data to upgrade-<LV>-<oldFSSID> thus creating a snapshot of old meta- and block-data, which will remain unchanged in this directory until the discard command is issued.
  2. Directory data is recreated, with an empty file version in it.
  3. Block files are hard linked to their old counterparts. Hard linking preserves old blocks from physical removal. Removing will just decrement the link count for the block file and remove an entry from the new data directory, but the block will still remain accessible from the old data directory. New blocks will be created in the new data directory. Currently we do not ever modify data in the blocks. We will need to duplicate linked files before modifying if we do.
  4. Now we write the actual field values into the version file indicating the upgrade is finished. If at startup the data-node finds an empty file version it knows that upgrade is not complete. It finds the directory upgrade-* and proceeds with the upgrade.
  5. Now directory upgrade-<LV>-<oldFSSID> can be renamed to data-<LV>-<oldFSSID>.
  6. The data-node confirms with the name-node the completion of the upgrade by sending registration request again this time with the list of available FSSIDs containing the <oldFSSID>. The name-node is supposed to accept registration and mark the data-node upgraded.

The Discard Version Command.

Previous snapshots of file system states can be removed by starting the name-node with the –discard option.

bin/upgrade-dfs –discard [<oldFSSID>]

The name-node obtains the list of supported version FSSIDs by examining the namespace directory root. During regular cluster operation the data-nodes send versions they possess to the name-node piggybacked on registration and block report requests. The supported versions are displayed in the web UI for each node.

When the name-node starts with –discard option it

  1. Starts in safe mode and reads the state from the current directory
  2. Renames directory data-<LV>-<FSSID> to discard-<LV>-<FSSID>
  3. When the name-node receives registrations and block reports it checks whether the data-node has the state being discarded, and replies with the discard command if it does.

When a data-node receives the discard command it

  1. Renames the directory data-<LV>-<oldFSSID> corresponding to the version to be discarded to discard-<LV>-<oldFSSID>.
  2. Sends confirmation to the name-nodes that the version is removed.
  3. Starts a new thread, which will remove local data from the old version.
  4. At startup each data-node checks its supported states and removes directories named discard-*.

The Rollback Command.

The rollback command reverses the upgrade replacing current version by the one specified.

bin/upgrade-dfs –rollback <oldFSSID>

In order to rollback the name-node and data-nodes should be started with the –rollback option. Due to possible software failure or inconsistency of data the cluster might have become dysfunctional. We are trying to prevent servers from reading persistent data from their current directories. In this case the cluster components should start working directly with the previous healthy state.

When the name-node starts with –rollback option it

  1. Renames directory data (if exists) to rollback-<LV>-<oldFSSID>.
  2. Renames directory data-<LV>-<oldFSSID> to data.
  3. Starts in safe mode and verifies that all data-nodes confirmed during registration that they rolled back the appropriate version. Regular registrations are ignored during the rollback stage.
  4. The confirmed data-nodes start sending block reports. When the safe mode conditions are satisfied the rollback is complete.
  5. Removes directory rollback-<LV>-<oldFSSID>.
  6. Data-nodes that failed to rollback are explicitly logged and reported by the name-node.

When a data-node starts with –rollback option it

  1. Discards contents of the current data directory (if exists).
  2. Renames directory data-<LV>-<oldFSSID> to data.
  3. Starts in data and sends the rollback completion confirmation to the name-node along with the registration request. The name-node is supposed to accept registration and mark the data-node rolled back.

Note.

Taking into account massive layout changes and also that upgrades with rollback are hard to support for versions with different directory structures, I propose to keep automatic data conversion (as we had until now) from the current version to the next one. No going back (rollback) and force (upgrade) between current and the next versions. The upgrades will be supported once the data is converted to the new format.

 

For detailed algorithms see FSStateTransition.html document.