Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2433

TestFileAppend4 fails intermittently

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 0.20.205.0, 1.0.0
    • Fix Version/s: None
    • Component/s: datanode, namenode, test
    • Labels:
      None
    • Target Version/s:

      Description

      A Jenkins build we have running failed twice in a row with issues form TestFileAppend4.testAppendSyncReplication1 in an attempt to reproduce the error I ran TestFileAppend4 in a loop over night saving the results away. (No clean was done in between test runs)

      When TestFileAppend4 is run in a loop the testAppendSyncReplication[012] tests fail about 10% of the time (14 times out of 130 tries) They all fail with something like the following. Often it is only one of the tests that fail, but I have seen as many as two fail in one run.

      Testcase: testAppendSyncReplication2 took 32.198 sec
              FAILED
      Should have 2 replicas for that block, not 1
      junit.framework.AssertionFailedError: Should have 2 replicas for that block, not 1
              at org.apache.hadoop.hdfs.TestFileAppend4.replicationTest(TestFileAppend4.java:477)
              at org.apache.hadoop.hdfs.TestFileAppend4.testAppendSyncReplication2(TestFileAppend4.java:425)
      

      I also saw several other tests that are a part of TestFileApped4 fail during this experiment. They may all be related to one another so I am filing them in the same JIRA. If it turns out that they are not related then they can be split up later.

      testAppendSyncBlockPlusBbw failed 6 out of the 130 times or about 5% of the time

      Testcase: testAppendSyncBlockPlusBbw took 1.633 sec
              FAILED
      unexpected file size! received=0 , expected=1024
      junit.framework.AssertionFailedError: unexpected file size! received=0 , expected=1024
              at org.apache.hadoop.hdfs.TestFileAppend4.assertFileSize(TestFileAppend4.java:136)
              at org.apache.hadoop.hdfs.TestFileAppend4.testAppendSyncBlockPlusBbw(TestFileAppend4.java:401)
      

      testAppendSyncChecksum[012] failed 2 out of the 130 times or about 1.5% of the time

      Testcase: testAppendSyncChecksum1 took 32.385 sec
              FAILED
      Should have 1 replica for that block, not 2
      junit.framework.AssertionFailedError: Should have 1 replica for that block, not 2
              at org.apache.hadoop.hdfs.TestFileAppend4.checksumTest(TestFileAppend4.java:556)
              at org.apache.hadoop.hdfs.TestFileAppend4.testAppendSyncChecksum1(TestFileAppend4.java:500)
      

      I will attach logs for all of the failures. Be aware that I did change some of the logging messages in this test so I could better see when testAppendSyncReplication started and ended. Other then that the code is stock 0.20.205 RC2

      1. failed.tar.bz2
        3.03 MB
        Robert Joseph Evans

        Activity

        Hide
        Matt Foley added a comment -

        Changed Target Version to 1.3.0 upon release of 1.2.0. Please change to 1.2.1 if you intend to submit a fix for branch-1.2.

        Show
        Matt Foley added a comment - Changed Target Version to 1.3.0 upon release of 1.2.0. Please change to 1.2.1 if you intend to submit a fix for branch-1.2.
        Hide
        Robert Joseph Evans added a comment -

        That could be true, but it does not seem to fit.

        testAppendSyncReplication[012] brings up a mini DFS cluster writes some data to a file. Kills off one of the datanodes and then appends some more data to the file. It then closes the file, shuts down the cluster. Brings up a new cluster in safemode, and tries to verify that the expected number of replications are there.

        It is showing too few replications. It could be that there is a race condition on startup of the second minicluster where not all of the datanodes have finished their full block reports before we ask how many replicas there are.

        Show
        Robert Joseph Evans added a comment - That could be true, but it does not seem to fit. testAppendSyncReplication [012] brings up a mini DFS cluster writes some data to a file. Kills off one of the datanodes and then appends some more data to the file. It then closes the file, shuts down the cluster. Brings up a new cluster in safemode, and tries to verify that the expected number of replications are there. It is showing too few replications. It could be that there is a race condition on startup of the second minicluster where not all of the datanodes have finished their full block reports before we ask how many replicas there are.
        Hide
        Todd Lipcon added a comment -

        my hunch is that this is related to HDFS-1172.

        Show
        Todd Lipcon added a comment - my hunch is that this is related to HDFS-1172 .
        Hide
        Robert Joseph Evans added a comment -

        The complete set of logs for all of the failures. (It is rather large)

        Show
        Robert Joseph Evans added a comment - The complete set of logs for all of the failures. (It is rather large)

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Joseph Evans
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development