Hadoop Common
  1. Hadoop Common
  2. HADOOP-306

Safe mode and name node startup procedures

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.2
    • Fix Version/s: 0.7.0
    • Component/s: None
    • Labels:
      None

      Description

      This is a proposal to improve DFS cluster startup process.
      The data node startup procedures were described and implemented in HADOOP-124.
      I'm trying to extend them to the name node here.
      The main idea is to introduce safe mode, which can be entered manually for administration
      purposes, or automatically when a configurable threshold of active data nodes is breached,
      or at startup when the node stays in safe mode until the minimal limit of active
      nodes is reached.

      This are high level requirements intended to improve the name node and cluster reliability.
      = The name node safe mode means that the name node is not changing the state of the
      file system. Meta data is read-only, and block replication / removal is not taking place.
      = In safe mode the name node accepts data node registrations and
      processes their block reports.
      = The name node always starts in safe mode and stays safe until the majority
      (a configurable parameter: safemode.threshold) of data nodes (or blocks?)
      is reported.
      = The name node can also fall into safe mode when the number of non-active
      (heartbeats stopped coming in) data nodes becomes critical.
      = The startup "silent period", when the name node is in safe mode and is
      not issuing any block requests to the data nodes, is initially set to a
      configurable value safemode.timeout.increment. By the end of the timeout
      the name node checks the safemode.threshold and decides whether to switch
      to the normal mode or to stay in safe. If the normal mode criteria is not
      met, then the silent period is extended by incrementing the safemode timeout.
      = The name node stays in safe mode not longer than a configurable value of
      safemode.timeout.max, in which case it logs missing data nodes and shuts
      itself down.
      = When the name node switches to normal mode it checks whether all required
      data nodes have actually registered, based on the list of active data storages
      from the last session. Then it logs missing nodes, if any, and starts
      replicating and/or deleting blocks as required.
      = A historical list of data storages (nodes) ever registered with the cluster is
      persistently stored in the image and log files. The list is used in two ways:
      a) at startup to verify whether all nodes have registered, and to report
      missing nodes;
      b) at runtime if a data node registers with a new storage id the
      name node verifies that no new blocks are reported from that storage,
      which would prevent us from accidentally connecting data nodes from a
      different cluster.
      = The name node should have an option to run in safe mode. Starting with
      that option would mean it never leaves safe mode.
      This is useful for testing the cluster.
      = Data nodes that can not connect to the name node for a long time (configurable)
      should shut down themselves.

      1. SafeMode.patch
        40 kB
        Konstantin Shvachko
      2. SafeModeEnum.patch
        42 kB
        Konstantin Shvachko
      3. SafeMode.html
        19 kB
        Ravi Phulari

        Issue Links

          Activity

          Konstantin Shvachko created issue -
          Doug Cutting made changes -
          Field Original Value New Value
          Fix Version/s 0.4.0 [ 12311021 ]
          Fix Version/s 0.5.0 [ 12311939 ]
          Konstantin Shvachko made changes -
          Assignee Konstantin Shvachko [ shv ]
          Doug Cutting made changes -
          Workflow no-reopen-closed [ 12373982 ] no-reopen-closed, patch-avail [ 12377499 ]
          Doug Cutting made changes -
          Fix Version/s 0.6.0 [ 12312025 ]
          Fix Version/s 0.5.0 [ 12311939 ]
          Konstantin Shvachko made changes -
          Attachment FSImageSaveDNInfo.patch [ 12338731 ]
          Konstantin Shvachko made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Doug Cutting made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Konstantin Shvachko made changes -
          Link This issue incorporates HADOOP-456 [ HADOOP-456 ]
          Konstantin Shvachko made changes -
          Attachment FSImageSaveDNInfo.patch [ 12338731 ]
          Yoram Arnon made changes -
          Comment [ du is inexpensive. See time comparisons of ls and du on the root of a DFS containing 4500 directories and 250000 files.
          'top' on the namenode showed no discernable difference.

          lsr is a different story, see timing at the bottom.

          >time hadoop dfs -du /

          real 0m2.217s
          user 0m0.473s
          sys 0m0.100s

          >time hadoop dfs -ls /

          real 0m2.036s
          user 0m0.469s
          sys 0m0.096s


          >time hadoop dfs -lsr /

          real 0m55.100s
          user 0m25.186s
          sys 0m4.105s ]
          Doug Cutting made changes -
          Fix Version/s 0.6.0 [ 12312025 ]
          Konstantin Shvachko made changes -
          Attachment SafeMode.patch [ 12340888 ]
          Konstantin Shvachko made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Konstantin Shvachko made changes -
          Link This issue incorporates HADOOP-430 [ HADOOP-430 ]
          Doug Cutting made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Konstantin Shvachko made changes -
          Attachment SafeModeEnum.patch [ 12341182 ]
          Konstantin Shvachko made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Doug Cutting made changes -
          Fix Version/s 0.7.0 [ 12312051 ]
          Resolution Fixed [ 1 ]
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Doug Cutting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Ravi Phulari made changes -
          Attachment SafeMode.html [ 12422025 ]

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Konstantin Shvachko
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development