Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-13732

TestHBaseFsck#testParallelWithRetriesHbck fails intermittently

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.1.0, 1.2.0, 2.0.0
    • Fix Version/s: 1.2.0, 1.1.1, 2.0.0
    • Component/s: hbck, test
    • Labels:
      None

      Description

      TestHBaseFsck#testParallelWithRetriesHbck failed intermittently (especially in Windows environment) with "java.io.IOException: Duplicate hbck - Abort"

      java.util.concurrent.ExecutionException: java.io.IOException: Duplicate hbck - Abort
      	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
      	at java.util.concurrent.FutureTask.get(FutureTask.java:111)
      	at org.apache.hadoop.hbase.util.TestHBaseFsck.testParallelWithRetriesHbck(TestHBaseFsck.java:644)
      Caused by: java.io.IOException: Duplicate hbck - Abort
      	at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:484)
      	at org.apache.hadoop.hbase.util.hbck.HbckTestingUtil.doFsck(HbckTestingUtil.java:53)
      	at org.apache.hadoop.hbase.util.hbck.HbckTestingUtil.doFsck(HbckTestingUtil.java:43)
      	at org.apache.hadoop.hbase.util.hbck.HbckTestingUtil.doFsck(HbckTestingUtil.java:38)
      	at org.apache.hadoop.hbase.util.TestHBaseFsck$2RunHbck.call(TestHBaseFsck.java:635)
      	at org.apache.hadoop.hbase.util.TestHBaseFsck$2RunHbck.call(TestHBaseFsck.java:628)
      	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
      	at java.lang.Thread.run(Thread.java:722)
      

      HBASE-13591 tried to address this issue. It did improve the pass rate in Linux environment (after the fix, I could not repro in my machine); however, the test still failed intermittently in Windows environment during testing of 1.1 release.

      Looking at the code, it uses the ExponentialBackoffPolicy (starting with 200ms sleep time after first failed attempt to acquire the lock in ZK, then 400ms, then 800ms, etc.) in between retries. Therefore, even the first hbck run completes, the second hbck run would still fail due to long sleep time.

      the proposal to fix the problem is to use ExponentialBackoffPolicyWithLimit and cap the max sleep time to some small number (eg. 5 seconds, it should be configurable). This would make the test more robust.

        Attachments

        1. HBASE-13732-addendum-master.patch
          2 kB
          Stephen Yuan Jiang
        2. HBASE-13732-addendum-branch-1.patch
          3 kB
          Stephen Yuan Jiang
        3. HBASE-13732.patch
          8 kB
          Stephen Yuan Jiang

          Issue Links

            Activity

              People

              • Assignee:
                syuanjiang Stephen Yuan Jiang
                Reporter:
                syuanjiang Stephen Yuan Jiang
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: