Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3530

TestFileAppend2.testComplexAppend occasionally fails

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: test
    • Labels:
      None

      Description

      TestFileAppend2.testComplexAppend occasionally fails with the following:

      junit.framework.AssertionFailedError: testComplexAppend Worker encountered exceptions.
      at junit.framework.Assert.fail(Assert.java:47)
      at junit.framework.Assert.assertTrue(Assert.java:20)
      at org.apache.hadoop.hdfs.TestFileAppend2.testComplexAppend(TestFileAppend2.java:385)

      1. HDFS-3530.txt
        3 kB
        Tomohiko Kinebuchi
      2. HDFS-3530-for-debug.txt
        2 kB
        Tomohiko Kinebuchi
      3. PreCommit-HADOOP-Build #1116 test - testComplexAppend.html.gz
        881 kB
        Eli Collins

        Activity

        Hide
        Eli Collins added a comment -

        Full test log attached.

        Show
        Eli Collins added a comment - Full test log attached.
        Hide
        Sho Shimauchi added a comment -

        The following log appears in the Jenkins log:

        2012-06-12 23:58:17,839 INFO namenode.FileJournalManager (FileJournalManager.java:finalizeLogSegment(100)) - Finalizing edits file /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_inprogress_0000000000000000001 -> /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_0000000000000000001-0000000000000001003
        2012-06-12 23:58:17,839 ERROR namenode.NNStorage (NNStorage.java:reportErrorsOnDirectory(888)) - Error reported on storage directory Storage Directory /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1
        2012-06-12 23:58:17,840 WARN namenode.NNStorage (NNStorage.java:reportErrorsOnDirectory(893)) - About to remove corresponding storage: /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1
        2012-06-12 23:58:17,840 ERROR namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(341)) - Error: finalize log segment 1, 1003 failed for (journal JournalAndStream(mgr=FileJournalManager(root=/home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1), stream=null))
        java.lang.IllegalStateException: Unable to finalize edits file /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_inprogress_0000000000000000001
        at org.apache.hadoop.hdfs.server.namenode.FileJournalManager.finalizeLogSegment(FileJournalManager.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:198)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:323)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.finalizeLogSegment(JournalSet.java:193)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:980)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.close(FSEditLog.java:324)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:676)
        at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.stopActiveServices(NameNode.java:1325)
        at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.exitState(ActiveState.java:70)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.stop(NameNode.java:630)
        at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1309)
        at org.apache.hadoop.hdfs.TestFileAppend2.testComplexAppend(TestFileAppend2.java:379)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at junit.framework.TestCase.runTest(TestCase.java:168)
        at junit.framework.TestCase.runBare(TestCase.java:134)
        at junit.framework.TestResult$1.protect(TestResult.java:110)
        at junit.framework.TestResult.runProtected(TestResult.java:128)
        at junit.framework.TestResult.run(TestResult.java:113)
        at junit.framework.TestCase.run(TestCase.java:124)
        at junit.framework.TestSuite.runTest(TestSuite.java:243)
        at junit.framework.TestSuite.run(TestSuite.java:238)
        at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
        at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236)
        at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134)
        at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
        at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
        at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
        at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74)

        This exception looks similar to HDFS-2939.
        FileJournalManager.finalizeLogSegment() failed to rename an inprogress edits file to finalize file.

          @Override
          synchronized public void finalizeLogSegment(long firstTxId, long lastTxId)
              throws IOException {
            File inprogressFile = NNStorage.getInProgressEditsFile(sd, firstTxId);
        
            File dstFile = NNStorage.getFinalizedEditsFile(
                sd, firstTxId, lastTxId);
            LOG.info("Finalizing edits file " + inprogressFile + " -> " + dstFile);
            
            Preconditions.checkState(!dstFile.exists(),
                "Can't finalize edits file " + inprogressFile + " since finalized file " +
                "already exists");
            if (!inprogressFile.renameTo(dstFile)) {
              storage.reportErrorsOnDirectory(sd);
              throw new IllegalStateException("Unable to finalize edits file " + inprogressFile);
            }
            if (inprogressFile.equals(currentInProgress)) {
              currentInProgress = null;
            }
          }
        

        One of the probable cause of renaming failure is the inprogress file was not closed.
        It is closed just before finalizedLogSegment() in JournalSet.JournalClosure.apply(), but the code block is not synchronized.

          @Override
          public void finalizeLogSegment(final long firstTxId, final long lastTxId)
              throws IOException {
            mapJournalsAndReportErrors(new JournalClosure() {
              @Override
              public void apply(JournalAndStream jas) throws IOException {
                if (jas.isActive()) {
                  jas.closeStream();
                  jas.getManager().finalizeLogSegment(firstTxId, lastTxId);
                }
              }
            }, "finalize log segment " + firstTxId + ", " + lastTxId);
          }
        

        So I guess some race condition happens in the above part.
        Is it ok to just enclose jas.closeStream() and jas.getManager().finalizeLogSegment() in synchronized block?

        Show
        Sho Shimauchi added a comment - The following log appears in the Jenkins log: 2012-06-12 23:58:17,839 INFO namenode.FileJournalManager (FileJournalManager.java:finalizeLogSegment(100)) - Finalizing edits file /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_inprogress_0000000000000000001 -> /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_0000000000000000001-0000000000000001003 2012-06-12 23:58:17,839 ERROR namenode.NNStorage (NNStorage.java:reportErrorsOnDirectory(888)) - Error reported on storage directory Storage Directory /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1 2012-06-12 23:58:17,840 WARN namenode.NNStorage (NNStorage.java:reportErrorsOnDirectory(893)) - About to remove corresponding storage: /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1 2012-06-12 23:58:17,840 ERROR namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(341)) - Error: finalize log segment 1, 1003 failed for (journal JournalAndStream(mgr=FileJournalManager(root=/home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1), stream=null)) java.lang.IllegalStateException: Unable to finalize edits file /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_inprogress_0000000000000000001 at org.apache.hadoop.hdfs.server.namenode.FileJournalManager.finalizeLogSegment(FileJournalManager.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:198) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:323) at org.apache.hadoop.hdfs.server.namenode.JournalSet.finalizeLogSegment(JournalSet.java:193) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:980) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.close(FSEditLog.java:324) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:676) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.stopActiveServices(NameNode.java:1325) at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.exitState(ActiveState.java:70) at org.apache.hadoop.hdfs.server.namenode.NameNode.stop(NameNode.java:630) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1309) at org.apache.hadoop.hdfs.TestFileAppend2.testComplexAppend(TestFileAppend2.java:379) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:243) at junit.framework.TestSuite.run(TestSuite.java:238) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189) at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165) at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74) This exception looks similar to HDFS-2939 . FileJournalManager.finalizeLogSegment() failed to rename an inprogress edits file to finalize file. @Override synchronized public void finalizeLogSegment( long firstTxId, long lastTxId) throws IOException { File inprogressFile = NNStorage.getInProgressEditsFile(sd, firstTxId); File dstFile = NNStorage.getFinalizedEditsFile( sd, firstTxId, lastTxId); LOG.info( "Finalizing edits file " + inprogressFile + " -> " + dstFile); Preconditions.checkState(!dstFile.exists(), "Can't finalize edits file " + inprogressFile + " since finalized file " + "already exists" ); if (!inprogressFile.renameTo(dstFile)) { storage.reportErrorsOnDirectory(sd); throw new IllegalStateException( "Unable to finalize edits file " + inprogressFile); } if (inprogressFile.equals(currentInProgress)) { currentInProgress = null ; } } One of the probable cause of renaming failure is the inprogress file was not closed. It is closed just before finalizedLogSegment() in JournalSet.JournalClosure.apply(), but the code block is not synchronized. @Override public void finalizeLogSegment( final long firstTxId, final long lastTxId) throws IOException { mapJournalsAndReportErrors( new JournalClosure() { @Override public void apply(JournalAndStream jas) throws IOException { if (jas.isActive()) { jas.closeStream(); jas.getManager().finalizeLogSegment(firstTxId, lastTxId); } } }, "finalize log segment " + firstTxId + ", " + lastTxId); } So I guess some race condition happens in the above part. Is it ok to just enclose jas.closeStream() and jas.getManager().finalizeLogSegment() in synchronized block?
        Hide
        Sho Shimauchi added a comment -

        nvm, my guess is probably wrong.
        It's probably not the race condition in the apply() method.
        I'll look through the code carefully.

        Show
        Sho Shimauchi added a comment - nvm, my guess is probably wrong. It's probably not the race condition in the apply() method. I'll look through the code carefully.
        Hide
        Tomohiko Kinebuchi added a comment -

        I created and attached a patch for debugging.
        Could someone apply this patch on the test class and run that many times?

        Show
        Tomohiko Kinebuchi added a comment - I created and attached a patch for debugging. Could someone apply this patch on the test class and run that many times?
        Hide
        Tomohiko Kinebuchi added a comment -

        Could someone give me contributor permission and assign to me?

        Show
        Tomohiko Kinebuchi added a comment - Could someone give me contributor permission and assign to me?
        Hide
        Tomohiko Kinebuchi added a comment -

        It seems that this tests failed only once as far as I can track on the Jenkins test result history.
        I was trying to reproduce this failure, but have not been successful so far.
        Then, I am now inspecting log messages.

        Does someone know what this test case tests?

        Show
        Tomohiko Kinebuchi added a comment - It seems that this tests failed only once as far as I can track on the Jenkins test result history. I was trying to reproduce this failure, but have not been successful so far. Then, I am now inspecting log messages. Does someone know what this test case tests?
        Hide
        Tomohiko Kinebuchi added a comment -

        Attached a patch and its intent is described below.

        The first error message is "Workload exception 4 testfile /15.dat java.io.EOFException: Premature EOF: no length prefix available", and after this moment all other Workload threads failed.

        The essential problem on this test failure is unclearness of the objective of this test case and difficulties to understand what ever happened.
        At least we should clear what error case we want to capture, then we are prepared to implement a test case.

        So, I propose to delete this test case.

        Show
        Tomohiko Kinebuchi added a comment - Attached a patch and its intent is described below. The first error message is "Workload exception 4 testfile /15.dat java.io.EOFException: Premature EOF: no length prefix available", and after this moment all other Workload threads failed. The essential problem on this test failure is unclearness of the objective of this test case and difficulties to understand what ever happened. At least we should clear what error case we want to capture, then we are prepared to implement a test case. So, I propose to delete this test case.
        Hide
        Tomohiko Kinebuchi added a comment -

        a patch file for HDFS-3530

        Show
        Tomohiko Kinebuchi added a comment - a patch file for HDFS-3530
        Hide
        Todd Lipcon added a comment -

        I agree it's not a great unit test case for the reason you mentioned, but deleting it without understanding why it fails sometimes doesn't seem right. It acts as a decent stress-test/functional test, and it's still worrisome that it fails.

        Show
        Todd Lipcon added a comment - I agree it's not a great unit test case for the reason you mentioned, but deleting it without understanding why it fails sometimes doesn't seem right. It acts as a decent stress-test/functional test, and it's still worrisome that it fails.

          People

          • Assignee:
            Tomohiko Kinebuchi
            Reporter:
            Eli Collins
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development