Hadoop Common
  1. Hadoop Common
  2. HADOOP-306

Safe mode and name node startup procedures

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.2
    • Fix Version/s: 0.7.0
    • Component/s: None
    • Labels:
      None

      Description

      This is a proposal to improve DFS cluster startup process.
      The data node startup procedures were described and implemented in HADOOP-124.
      I'm trying to extend them to the name node here.
      The main idea is to introduce safe mode, which can be entered manually for administration
      purposes, or automatically when a configurable threshold of active data nodes is breached,
      or at startup when the node stays in safe mode until the minimal limit of active
      nodes is reached.

      This are high level requirements intended to improve the name node and cluster reliability.
      = The name node safe mode means that the name node is not changing the state of the
      file system. Meta data is read-only, and block replication / removal is not taking place.
      = In safe mode the name node accepts data node registrations and
      processes their block reports.
      = The name node always starts in safe mode and stays safe until the majority
      (a configurable parameter: safemode.threshold) of data nodes (or blocks?)
      is reported.
      = The name node can also fall into safe mode when the number of non-active
      (heartbeats stopped coming in) data nodes becomes critical.
      = The startup "silent period", when the name node is in safe mode and is
      not issuing any block requests to the data nodes, is initially set to a
      configurable value safemode.timeout.increment. By the end of the timeout
      the name node checks the safemode.threshold and decides whether to switch
      to the normal mode or to stay in safe. If the normal mode criteria is not
      met, then the silent period is extended by incrementing the safemode timeout.
      = The name node stays in safe mode not longer than a configurable value of
      safemode.timeout.max, in which case it logs missing data nodes and shuts
      itself down.
      = When the name node switches to normal mode it checks whether all required
      data nodes have actually registered, based on the list of active data storages
      from the last session. Then it logs missing nodes, if any, and starts
      replicating and/or deleting blocks as required.
      = A historical list of data storages (nodes) ever registered with the cluster is
      persistently stored in the image and log files. The list is used in two ways:
      a) at startup to verify whether all nodes have registered, and to report
      missing nodes;
      b) at runtime if a data node registers with a new storage id the
      name node verifies that no new blocks are reported from that storage,
      which would prevent us from accidentally connecting data nodes from a
      different cluster.
      = The name node should have an option to run in safe mode. Starting with
      that option would mean it never leaves safe mode.
      This is useful for testing the cluster.
      = Data nodes that can not connect to the name node for a long time (configurable)
      should shut down themselves.

      1. SafeMode.html
        19 kB
        Ravi Phulari
      2. SafeModeEnum.patch
        42 kB
        Konstantin Shvachko
      3. SafeMode.patch
        40 kB
        Konstantin Shvachko

        Issue Links

          Activity

          Hide
          Konstantin Shvachko added a comment -

          This patch implements a part of the laid out design.
          It stores the historical list of datanodes in the image file and logs newly
          registered nodes in the edits file.
          The changes are mostly related to the FSNamesystem class.
          Since datanodes are not removed from the datanodeMap when they are considered
          non responsive, the deadDatanodeMap field becomes redundant. I removed it.
          There is a change in semantics of the relation between heartbeats and datanodeMap maps.
          heartbeats contains only live nodes while datanodeMap contains both alive and dead nodes.
          See JavaDoc for more details.
          So when we are looking for new targets for block replication we should check the heartbeats
          map rather than the datanodeMap as we did before.
          Also since the DatanodeDescriptors are not physically removed from datanodeMap
          I had to add their blocks cleanup while processing lost heartbeats.
          Some changes to the FSImage and FSEditLog classes. I removed unnecessary
          parameter to FSDirectory from the previous version. The whole name space can be
          accessed via static method FSNamesystem.getFSNamesystem()

          Show
          Konstantin Shvachko added a comment - This patch implements a part of the laid out design. It stores the historical list of datanodes in the image file and logs newly registered nodes in the edits file. The changes are mostly related to the FSNamesystem class. Since datanodes are not removed from the datanodeMap when they are considered non responsive, the deadDatanodeMap field becomes redundant. I removed it. There is a change in semantics of the relation between heartbeats and datanodeMap maps. heartbeats contains only live nodes while datanodeMap contains both alive and dead nodes. See JavaDoc for more details. So when we are looking for new targets for block replication we should check the heartbeats map rather than the datanodeMap as we did before. Also since the DatanodeDescriptors are not physically removed from datanodeMap I had to add their blocks cleanup while processing lost heartbeats. Some changes to the FSImage and FSEditLog classes. I removed unnecessary parameter to FSDirectory from the previous version. The whole name space can be accessed via static method FSNamesystem.getFSNamesystem()
          Hide
          Doug Cutting added a comment -

          Konstantin, since this patch only partially fixes this issue, can you please instead add this patch to a separate issue that this issue depends on? Thanks!

          Show
          Doug Cutting added a comment - Konstantin, since this patch only partially fixes this issue, can you please instead add this patch to a separate issue that this issue depends on? Thanks!
          Hide
          Konstantin Shvachko added a comment -

          I'd like to discuss the meaning of the safemode.threshold parameter, which defines
          when the name node can leave safe mode and start replication/deletion of blocks.
          I see at least 2 meanings:
          1) % of storage ids reported so far. Storage id is a unique id of a data node
          that identifies the storage rather than the data node address.
          2) % of blocks reported so far by all data nodes.
          They seem to be pretty independent. Even if we have 100% storage ids
          reported that does not mean we have 100% blocks.
          What 100% means for blocks? Does it mean the name node has
          2a) at least one copy of each block or
          2b) the complete list of replicas for each block.
          All three constraints seem reasonable.
          100% in (1) is the best one could physically do, since all storages are
          reported, it is just that they have corrupted or missing blocks.
          100% in (2a) is the minimal requirement for the name node to start
          without any data loss, and
          100% in (2b) would be the perfect state of the system.
          So the question is what do we really want?

          Show
          Konstantin Shvachko added a comment - I'd like to discuss the meaning of the safemode.threshold parameter, which defines when the name node can leave safe mode and start replication/deletion of blocks. I see at least 2 meanings: 1) % of storage ids reported so far. Storage id is a unique id of a data node that identifies the storage rather than the data node address. 2) % of blocks reported so far by all data nodes. They seem to be pretty independent. Even if we have 100% storage ids reported that does not mean we have 100% blocks. What 100% means for blocks? Does it mean the name node has 2a) at least one copy of each block or 2b) the complete list of replicas for each block. All three constraints seem reasonable. 100% in (1) is the best one could physically do, since all storages are reported, it is just that they have corrupted or missing blocks. 100% in (2a) is the minimal requirement for the name node to start without any data loss, and 100% in (2b) would be the perfect state of the system. So the question is what do we really want?
          Hide
          Yoram Arnon added a comment -

          OTOH, I'd want the namenode to be able to move out of safe mode with minimal restrictions, to avoid unnecessary manual intervention.
          OTOH, I want to avoid unnecessary thrash on startup, while data nodes are connecting.
          So any solution would require both a threshold, and some timer once that threshold is reached, to avoid unnecessary replications the second the threshold is reached, before the last datanodes connect.

          I'd go with a configuration requiring:

          • 100% blocks are available, at least one replica
          • allow one more minute for stragglers to connect and report their blocks

          typically this would allow up to two nodes to be missing, but if by some magic more nodes are missing and I still have all my data intact, such as when we implement rack aware data placement and a rack goes down, hallelujah - I wait one minute and start re-replicating.

          Show
          Yoram Arnon added a comment - OTOH, I'd want the namenode to be able to move out of safe mode with minimal restrictions, to avoid unnecessary manual intervention. OTOH, I want to avoid unnecessary thrash on startup, while data nodes are connecting. So any solution would require both a threshold, and some timer once that threshold is reached, to avoid unnecessary replications the second the threshold is reached, before the last datanodes connect. I'd go with a configuration requiring: 100% blocks are available, at least one replica allow one more minute for stragglers to connect and report their blocks typically this would allow up to two nodes to be missing, but if by some magic more nodes are missing and I still have all my data intact, such as when we implement rack aware data placement and a rack goes down, hallelujah - I wait one minute and start re-replicating.
          Hide
          Sameer Paranjpye added a comment -

          I wonder if it makes sense to wait for 100% of blocks to be available. It's possible for data to go missing because of failed drives, nodes, racks... When some data goes missing it's usually the case that some other (co-located) data becomes under-replicated and the Namenode ought to start replicating the under replicated data. Why do we want safe mode?

          • To prevent replication thrash when the Namenode starts
          • To enable administrators to make the file system read only for diagnosis and debugging

          Neither of these require that 100% of blocks are present. Maybe we should have a slightly lower threshold for blocks or storage ids.

          Show
          Sameer Paranjpye added a comment - I wonder if it makes sense to wait for 100% of blocks to be available. It's possible for data to go missing because of failed drives, nodes, racks... When some data goes missing it's usually the case that some other (co-located) data becomes under-replicated and the Namenode ought to start replicating the under replicated data. Why do we want safe mode? To prevent replication thrash when the Namenode starts To enable administrators to make the file system read only for diagnosis and debugging Neither of these require that 100% of blocks are present. Maybe we should have a slightly lower threshold for blocks or storage ids.
          Hide
          Konstantin Shvachko added a comment -

          Thanks for the comments. Based on what you say I can define the threshold constraint as:

            • the percentage of blocks that meet minimal replication requirement (defined by dfs.safemode.replication)
              I think this would be flexible enough.
              And I'll keep the dfs.datanode.startupMsec that defines the startup period during which replications are not allowed.
              So the name node will stay in safe mode at start up until the threshold is reached AND the startup period hasn't expired.
          Show
          Konstantin Shvachko added a comment - Thanks for the comments. Based on what you say I can define the threshold constraint as: the percentage of blocks that meet minimal replication requirement (defined by dfs.safemode.replication) I think this would be flexible enough. And I'll keep the dfs.datanode.startupMsec that defines the startup period during which replications are not allowed. So the name node will stay in safe mode at start up until the threshold is reached AND the startup period hasn't expired.
          Hide
          Yoram Arnon added a comment -

          the timeout should start after the threshold is reached.
          If namenode is started long before datanodes, then as soon as the threshold is reached superfluous replications will occur.
          the timeout should be short, I'd say less than a minute, and should start as soon as the data/nodes threshold is reached, to allow stragglers to connect and report

          Show
          Yoram Arnon added a comment - the timeout should start after the threshold is reached. If namenode is started long before datanodes, then as soon as the threshold is reached superfluous replications will occur. the timeout should be short, I'd say less than a minute, and should start as soon as the data/nodes threshold is reached, to allow stragglers to connect and report
          Hide
          Konstantin Shvachko added a comment -

          This is the safe mode implementation, which implements everything that was discussed here
          except for the automatic entering safe mode when the number of non-active blocks falls below
          the threshold. This seems to be arguable that rather than replicating as quickly as it can the
          name node falls into safe mode.
          Please see JavaDoc for complete descriptions of the safe mode concept and its implementation.

          Other minor changes

          • Block report interval is configurable, but there was no value for it in hadoop-default.xml. I added it.
            <name>dfs.blockreport.intervalMsec</name>
          • A typo in hadoop-default.xml
          • There was a memory leak related to that name node never removed blocks from the bloksMap
            so it could grow large if the system works long enough. I fixed this unreported bug.
          • I moved the startup of http servers for both data and name nodes to the end of their constructors
            as discussed in HADOOP-430
          • ClientProtocol version is changed since setSafeMode() method is added
          Show
          Konstantin Shvachko added a comment - This is the safe mode implementation, which implements everything that was discussed here except for the automatic entering safe mode when the number of non-active blocks falls below the threshold. This seems to be arguable that rather than replicating as quickly as it can the name node falls into safe mode. Please see JavaDoc for complete descriptions of the safe mode concept and its implementation. Other minor changes Block report interval is configurable, but there was no value for it in hadoop-default.xml. I added it. <name>dfs.blockreport.intervalMsec</name> A typo in hadoop-default.xml There was a memory leak related to that name node never removed blocks from the bloksMap so it could grow large if the system works long enough. I fixed this unreported bug. I moved the startup of http servers for both data and name nodes to the end of their constructors as discussed in HADOOP-430 ClientProtocol version is changed since setSafeMode() method is added
          Hide
          Doug Cutting added a comment -

          The enum mode should not be passed & returned as an int, but rather as a boolean or enum.

          As a boolean this might look like:

          boolean setMode(boolean isSafe);
          boolean getMode();

          To pass an enum over an RPC would require either changes to ObjectWritable or to make the enum class implement Writable.

          Show
          Doug Cutting added a comment - The enum mode should not be passed & returned as an int, but rather as a boolean or enum. As a boolean this might look like: boolean setMode(boolean isSafe); boolean getMode(); To pass an enum over an RPC would require either changes to ObjectWritable or to make the enum class implement Writable.
          Hide
          Konstantin Shvachko added a comment -

          Making enum implement Writable does not work, since you cannot implement
          readFields() so that it'd modify fields (which are all final) in enum and change its state.

          I made a simple modification to ObjectWritable that treats enumerator type as a s
          pecial case like String or Array. This works fine for all simple (traditional) enum types.

          If we will ever want to use internal fields and other "fancy" enum features, which are
          considered to be confusing for most programmers anyway , then we will need to
          introduce parameters to the WritableFactory.newInstance() that could be used to
          construct a correct instance of the enum.

          This patch in addition to what it had before introduces an enum type SafeModeAction
          and a new functionality that lets RPC pass simple enumerator types.

          Show
          Konstantin Shvachko added a comment - Making enum implement Writable does not work, since you cannot implement readFields() so that it'd modify fields (which are all final) in enum and change its state. I made a simple modification to ObjectWritable that treats enumerator type as a s pecial case like String or Array. This works fine for all simple (traditional) enum types. If we will ever want to use internal fields and other "fancy" enum features, which are considered to be confusing for most programmers anyway , then we will need to introduce parameters to the WritableFactory.newInstance() that could be used to construct a correct instance of the enum. This patch in addition to what it had before introduces an enum type SafeModeAction and a new functionality that lets RPC pass simple enumerator types.
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Konstantin!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Konstantin!

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Konstantin Shvachko
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development