Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-13179

TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.0.0
    • 3.0.4, 3.3.0, 3.1.4, 3.2.2, 2.10.1
    • fs
    • None
    • Reviewed

    Description

      The error caused by TimeoutException because the test is waiting to ensure that the file is replicated to DISK storage but the replication can't be finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), but the file is still on RAM_DISK - so there is no data loss.

      Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially fixes the flakiness.

          try {
            ensureFileReplicasOnStorageType(path1, DEFAULT);
          }catch (TimeoutException t){
            LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on RAM_DISK");
            ensureFileReplicasOnStorageType(path1, RAM_DISK);
          }
        }
      

      Some thoughts:

      • Successful and failed tests run similar to the point when datanode restarts. Restart line is the following in the log: LazyPersistTestCase - Restarting the DataNode
      • There is a line which only occurs in the failed test: addStoredBlock: Redundant addStoredBlock request received for blk_1073741825_1001 on node 127.0.0.1:49455 size 5242880
      • This redundant request at BlockManager#addStoredBlock could be the main reason for the test fail. Something wrong with the gen stamp? Corrupt replicas?

      =============================

      Current fail ratio based on my test of TestLazyPersistReplicaRecovery:
      1000 runs, 34 failures (3.4% fail)

      Failure rate analysis:
      TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%

      33 failures caused by:

      java.util.concurrent.TimeoutException: Timed out waiting for condition. Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 on 39589" 
      

      1 failure caused by:

      java.net.BindException: Problem binding to [localhost:56729] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49) Caused by: java.net.BindException: Address already in use at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
      

      =============================

      Example stacktrace:

      Timed out waiting for condition. Thread diagnostics:
      Timestamp: 2017-11-01 10:36:49,499
      "Thread-1" prio=5 tid=13 runnable
      java.lang.Thread.State: RUNNABLE
      at java.lang.Thread.dumpThreads(Native Method)
      at java.lang.Thread.getAllStackTraces(Thread.java:1610)
      at org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
      at org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
      at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
      at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
      at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      ...
      

      Attachments

        1. HDFS-13179.001.patch
          6 kB
          Ahmed Hussein
        2. HDFS-13179.002.patch
          4 kB
          Ahmed Hussein
        3. HDFS-13179.003.patch
          5 kB
          Ahmed Hussein
        4. HDFS-13179-branch-2.10.003.patch
          6 kB
          Ahmed Hussein
        5. HDFS-13179-branch-3.0.003.patch
          5 kB
          Stephen O'Donnell
        6. test runs.zip
          52 kB
          Gabor Bota

        Activity

          People

            ahussein Ahmed Hussein
            gabor.bota Gabor Bota
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: