HBase
  1. HBase
  2. HBASE-2183 Ride over restart
  3. HBASE-2108

[HA] hbase cluster should be able to ride over hdfs 'safe mode' flip and namenode restart/move

    Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Todd Lipcon wrote up the following speculation on what happens when NN is restarted/goes away/replaced by backup under hbase (see Dhruba's note here, http://hadoopblog.blogspot.com/2009/11/hdfs-high-availability.html, that Eli pointed us at for some background on the 0.21 BackupNode feature):

      "For regions that are already open, HBase can continue to serve reads so long as the regionservers are up and do not change state. This is because the HDFS client APIs cache the DFS block locations (a map of block ID -> datanode addresses) for open files.

      "If any HBase action occurs that causes the regionservers to reopen a region (eg a region server fails, load balancing rebalances the region assignment, or a compaction finishes) then the reopen will fail as the new file will not be able to access the NameNode to receive the block locations. As these are all periodic operations for HBase, it's impossible to put a specific bound on this time, but my guess is that at least one region server is likely to crash within less than a minute of a NameNode unavailability.

      "Similar properties hold for writes. HBase's writing behavior is limited to Commit Logs which are kept open by the region servers. Writes to commit logs that are already open will continue to succeed, since they only involve the datanodes, but if a region server rolls an edit log, the open() for the new log will fail if the NN is unavailable. There is currently some work going on in HBase trunk to preallocate open files for commit logs to avoid this issue, but it is not complete, and it is not a full solution for the issue. The other issue is that the close() call that completes the write of a commit log also depends on a functioning NameNode - if it is unavailable, the log will be left in an indeterminate state and the edits may become lost when the NN recovers.

      "The rolling of commit logs is triggered either when a timer elapses or when a certain amount of data has been written. Thus, this failure mode will trigger quickly when data is constantly being written to the cluster. If little data is being written, it still may trigger due to the automatic periodic log rolling.

      "Given these above failure modes, I don't believe there is an effective HA solution for HBase at this point. Although HBase may continue to operate for a short time period while a NN recovers, it is also possible that it will fail nearly immediately, depending on when HBase's periodic operations happen to occur. Even with an automatic failover like DRBD+Heartbeat on the NN, the downtime may last 5-10 minutes as the new NN must both replay the edit log and receive block reports from every datanode before it can exit safe mode. I believe this will cause most NN failovers to be accompanied by a partial or complete failure of the HBase cluster."

      The above makes sense to me. Lets fix. Generally our mode up to this has been that if hdfs goes away, we've dealt with it on a regionserver by regionserver basis shutting itself down to protect against dataloss. We need to handle riding over NN restart/change of server.

        Issue Links

          Activity

          stack created issue -
          Hide
          Andrew Purtell added a comment -

          Even with an automatic failover like DRBD+Heartbeat on the NN, the downtime may last 5-10 minutes as the new NN must both replay the edit log and receive block reports from every datanode before it can exit safe mode. I believe this will cause most NN failovers to be accompanied by a partial or complete failure of the HBase cluster.

          I agree.

          We should be able to catch IOExceptions related to NN unavailability and handle them by deferring the work?

          Also, I can see a useful 0.20 HBase release which includes some backport of the fix for this issue. DRDB+Heartbeat is already used to fail over the 0.20 NameNode.

          Show
          Andrew Purtell added a comment - Even with an automatic failover like DRBD+Heartbeat on the NN, the downtime may last 5-10 minutes as the new NN must both replay the edit log and receive block reports from every datanode before it can exit safe mode. I believe this will cause most NN failovers to be accompanied by a partial or complete failure of the HBase cluster. I agree. We should be able to catch IOExceptions related to NN unavailability and handle them by deferring the work? Also, I can see a useful 0.20 HBase release which includes some backport of the fix for this issue. DRDB+Heartbeat is already used to fail over the 0.20 NameNode.
          Andrew Purtell made changes -
          Field Original Value New Value
          Link This issue relates to HBASE-2098 [ HBASE-2098 ]
          Andrew Purtell made changes -
          Assignee Andrew Purtell [ apurtell ]
          Andrew Purtell made changes -
          Parent HBASE-2183 [ 12455267 ]
          Issue Type Bug [ 1 ] Sub-task [ 7 ]
          Andrew Purtell made changes -
          Link This issue incorporates HBASE-846 [ HBASE-846 ]
          Andrew Purtell made changes -
          Link This issue relates to HBASE-2098 [ HBASE-2098 ]
          Hide
          dhruba borthakur added a comment -

          Hi folks, we are in the process of deploying some form of namenode HA via AvatarNode, details here:
          http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html

          This is similar to where the Hadoop BackupNode might navigate towards (in the future). The NN failover using this method is within a few seconds.

          Given the above, is anybody working on this issue? If not, may I work on it?

          Show
          dhruba borthakur added a comment - Hi folks, we are in the process of deploying some form of namenode HA via AvatarNode, details here: http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html This is similar to where the Hadoop BackupNode might navigate towards (in the future). The NN failover using this method is within a few seconds. Given the above, is anybody working on this issue? If not, may I work on it?
          Hide
          Andrew Purtell added a comment -

          @Dhruba: Have at it. I was about to start an audit of all places in the master and regionserver where an IOException from the filesystem might cause an abort. Then from there, introduce some retry logic. Is this what you have in mind?

          Show
          Andrew Purtell added a comment - @Dhruba: Have at it. I was about to start an audit of all places in the master and regionserver where an IOException from the filesystem might cause an abort. Then from there, introduce some retry logic. Is this what you have in mind?
          Andrew Purtell made changes -
          Assignee Andrew Purtell [ apurtell ]
          Andrew Purtell made changes -
          Link This issue relates to HBASE-1964 [ HBASE-1964 ]
          Hide
          dhruba borthakur added a comment -

          Thanks Andrew.

          I am thinking of making a few hdfs client parameters configurable (e.g. #of block-metadata cached by dfs ciient, maybe pre-allocate hdfs files, etc). HBase could then set/tune these parameters to survive namenode unavailability for upto a minute or so. If then correctly, then Hbase code should not even encounter any hdfs-related exception when the NN failover occurs. Do you think this is a feasible approach?

          Show
          dhruba borthakur added a comment - Thanks Andrew. I am thinking of making a few hdfs client parameters configurable (e.g. #of block-metadata cached by dfs ciient, maybe pre-allocate hdfs files, etc). HBase could then set/tune these parameters to survive namenode unavailability for upto a minute or so. If then correctly, then Hbase code should not even encounter any hdfs-related exception when the NN failover occurs. Do you think this is a feasible approach?
          Hide
          Andrew Purtell added a comment -

          My internal customers will want to be able to survive a 10 minute outage (or longer, but we need to set a reasonable expectation). Switching to some degraded operational mode would be acceptable. Perhaps you should open a HDFS jira to do what you propose and we can link this issue? They are related but have different aims, and both are worth doing.

          Show
          Andrew Purtell added a comment - My internal customers will want to be able to survive a 10 minute outage (or longer, but we need to set a reasonable expectation). Switching to some degraded operational mode would be acceptable. Perhaps you should open a HDFS jira to do what you propose and we can link this issue? They are related but have different aims, and both are worth doing.
          dhruba borthakur made changes -
          Link This issue is blocked by HDFS-1108 [ HDFS-1108 ]
          dhruba borthakur made changes -
          Link This issue is related to HDFS-976 [ HDFS-976 ]
          Hide
          stack added a comment -

          Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.

          Show
          stack added a comment - Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.
          stack made changes -
          Fix Version/s 0.22.0 [ 12314223 ]
          Fix Version/s 0.21.0 [ 12313607 ]
          Hide
          stack added a comment -

          Moving out of 0.92.0. Pull it back in if you think different.

          Show
          stack added a comment - Moving out of 0.92.0. Pull it back in if you think different.
          stack made changes -
          Fix Version/s 0.92.0 [ 12314223 ]
          Nicolas Liochon made changes -
          Link This issue relates to HBASE-5843 [ HBASE-5843 ]
          Hide
          Andrew Purtell added a comment -

          Superseded by HBASE-8338 and related.

          Show
          Andrew Purtell added a comment - Superseded by HBASE-8338 and related.
          Andrew Purtell made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              stack
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development