HBase
  1. HBase
  2. HBASE-4177

Handling read failures during recovery‏ - when HMaster calls Namenode recovery, recovery may be a failure leading to read failure while splitting logs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.95.1
    • Component/s: master
    • Labels:
      None

      Description

      As per the mailing thread with the heading
      'Handling read failures during recovery‏' we found this problem.
      As part of split Logs the HMaster calls Namenode recovery. The recovery is an asynchronous process.
      In HDFS
      =======
      Even though client is getting the updated block info from Namenode on first
      read failure, client is discarding the new info and using the old info only
      to retrieve the data from datanode. So, all the read
      retries are failing. [Method parameter reassignment - Not reflected in
      caller].
      In HBASE
      =======
      In HMaster code we tend to wait for 1sec. But if the recovery had some failure then split log may not happen and may lead to dataloss.
      So may be we need to decide upon the actual delay that needs to be introduced once Hmaster calls NN recovery.

        Issue Links

          Activity

          stack made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Nicolas Liochon made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 0.95.1 [ 12324288 ]
          Resolution Fixed [ 1 ]
          Hide
          Nicolas Liochon added a comment -

          To me, this was fixed when we made the recoverLease synchronous. Please reopen if I'm wrong.

          Show
          Nicolas Liochon added a comment - To me, this was fixed when we made the recoverLease synchronous. Please reopen if I'm wrong.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @N
          I too think the problem is still there. But internally here also we have not started working on this yet. At that time we had discussions that HDFS side also we need some changes and Stack has already raised the same in HDFS JIRA. Surely you can take a stab at it N.

          Show
          ramkrishna.s.vasudevan added a comment - @N I too think the problem is still there. But internally here also we have not started working on this yet. At that time we had discussions that HDFS side also we need some changes and Stack has already raised the same in HDFS JIRA. Surely you can take a stab at it N.
          Nicolas Liochon made changes -
          Link This issue relates to HBASE-5843 [ HBASE-5843 ]
          Hide
          Nicolas Liochon added a comment -

          hum, it's really closed to what I've done, but this problem may still be there. Ram, what do you think? If you don't have the time, I can give it a try.

          Show
          Nicolas Liochon added a comment - hum, it's really closed to what I've done, but this problem may still be there. Ram, what do you think? If you don't have the time, I can give it a try.
          Hide
          Lars Hofhansl added a comment -

          That is superseded by all of N's work, correct?

          Show
          Lars Hofhansl added a comment - That is superseded by all of N's work, correct?
          Hide
          ramkrishna.s.vasudevan added a comment -

          Any suggestions on this. We tend to run into this problem every now and then.

          Show
          ramkrishna.s.vasudevan added a comment - Any suggestions on this. We tend to run into this problem every now and then.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Stack
          Thanks for tracking this and raising an issue for the same in HDFS.

          Show
          ramkrishna.s.vasudevan added a comment - @Stack Thanks for tracking this and raising an issue for the same in HDFS.
          Hide
          stack added a comment -

          I created HDFS-2296 at Hairong's suggestion.

          Show
          stack added a comment - I created HDFS-2296 at Hairong's suggestion.
          stack made changes -
          Field Original Value New Value
          Priority Major [ 3 ] Critical [ 2 ]
          Hide
          Ted Yu added a comment -

          Looking at FSUtils.recoverFileLease(), we check the type of fs inside while loop. This is unnecessary.

          w.r.t. soft limit for the lease, we have:

                    if (waitedFor > FSConstants.LEASE_SOFTLIMIT_PERIOD) {
                      LOG.warn("Waited " + waitedFor + "ms for lease recovery on " + p +
                        ":" + e.getMessage());
                    }
          

          I think we should wait for the remainder of soft limit (which is 60 seconds).

          Show
          Ted Yu added a comment - Looking at FSUtils.recoverFileLease(), we check the type of fs inside while loop. This is unnecessary. w.r.t. soft limit for the lease, we have: if (waitedFor > FSConstants.LEASE_SOFTLIMIT_PERIOD) { LOG.warn( "Waited " + waitedFor + "ms for lease recovery on " + p + ":" + e.getMessage()); } I think we should wait for the remainder of soft limit (which is 60 seconds).
          ramkrishna.s.vasudevan created issue -

            People

            • Assignee:
              ramkrishna.s.vasudevan
              Reporter:
              ramkrishna.s.vasudevan
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development