Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.13.1, 0.14.3
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      [knoguchi]$ hadoop dfs -cat fileA
      07/09/13 17:36:02 INFO fs.DFSClient: Could not obtain block 0 from any node:
      java.io.IOException: No live nodes contain current block
      07/09/13 17:36:20 INFO fs.DFSClient: Could not obtain block 0 from any node:
      java.io.IOException: No live nodes contain current block
      [repeats forever]

      Setting one of the Debug statement to Warn, it kept on showing

       
       WARN org.apache.hadoop.fs.DFSClient: Failed to connect
      to /99.99.999.9 :11111:java.io.IOException: Recorded block size is 7496, but
      datanode reports size of 0
      	at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:690)
      	at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:771)
      	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
      	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
      	at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
      	at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
      	at java.io.DataInputStream.readFully(DataInputStream.java:178)
      	at java.io.DataInputStream.readFully(DataInputStream.java:152)
      	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:123)
      	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:340)
      	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:259)
      	at org.apache.hadoop.util.CopyFiles$FSCopyFilesMapper.map(CopyFiles.java:466)
      	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
      	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:186)
      	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1707)
      

      Turns out fileA was corrupted. Fsck showed crc file of 7496 bytes, but when I searched for the blocks on each node, 3 replicas were all size 0.

      Not sure how it got corrupted, but it would be nice if the dfs command fail instead of getting into an infinite loop.

      1. 1911-0.patch
        0.8 kB
        Chris Douglas

        Issue Links

          Activity

          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #451 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/451/ )
          Hide
          Chris Douglas added a comment -

          The test that failed is related to HADOOP-3139 (see Nicholas's last comment). The count of threads tracking the creation of lease threads has a race that only showed up in our tests for Java 1.5 on mac and linux on 0.16, but not in Java 1.6 or in trunk. Hudson might be hitting the same condition with trunk. We might consider putting the sleep in the test into trunk, as we did for 0.16.

          Show
          Chris Douglas added a comment - The test that failed is related to HADOOP-3139 (see Nicholas's last comment). The count of threads tracking the creation of lease threads has a race that only showed up in our tests for Java 1.5 on mac and linux on 0.16, but not in Java 1.6 or in trunk. Hudson might be hitting the same condition with trunk. We might consider putting the sleep in the test into trunk, as we did for 0.16.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12379436/1911-0.patch
          against trunk revision 643282.

          @author +1. The patch does not contain any @author tags.

          tests included -1. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379436/1911-0.patch against trunk revision 643282. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          I just committed this.

          Show
          Chris Douglas added a comment - I just committed this.
          Hide
          dhruba borthakur added a comment -

          This code change looks good.

          Show
          dhruba borthakur added a comment - This code change looks good.
          Hide
          Chris Douglas added a comment -

          I ran the Hudson validation on my dev box and- excluding the absence of unit tests- everything passes. If someone can +1 this then I can commit it and file a separate JIRA for a unit test.

          Show
          Chris Douglas added a comment - I ran the Hudson validation on my dev box and- excluding the absence of unit tests- everything passes. If someone can +1 this then I can commit it and file a separate JIRA for a unit test.
          Hide
          Chris Douglas added a comment -

          There is a very simple "fix" to this, i.e. make the "failures" count an instance var on DFSInputStream rather than a local variable in chooseDataNode. This would make the semantics of MAX_BLOCK_ACQUIRE_FAILURES to be a cap on the number of total block acquisition failures for the life of the stream, which is not exactly correct, but it is a fix we could easily get into 0.17. It will yield false negatives for a particularly problematic stream, but for applications like distcp it should be sufficient.

          After consulting with Dhruba, the longer-term fix will track failures not using a list of "deadnodes", but rather a map of blocks to a list of deadnodes and- to preserve the retry semantics- a map of blocks to full acquisition failures. Right now, a datanode that fails to serve a block is blacklisted on the stream until there are no replicas available for some block, when the list is cleared. The false negatives this yields require the existing, problematic retry semantics. After confirming this approach with Koji, I'll file a JIRA for the more correct fix and submit the sufficient one for 0.17

          Show
          Chris Douglas added a comment - There is a very simple "fix" to this, i.e. make the "failures" count an instance var on DFSInputStream rather than a local variable in chooseDataNode. This would make the semantics of MAX_BLOCK_ACQUIRE_FAILURES to be a cap on the number of total block acquisition failures for the life of the stream, which is not exactly correct, but it is a fix we could easily get into 0.17. It will yield false negatives for a particularly problematic stream, but for applications like distcp it should be sufficient. After consulting with Dhruba, the longer-term fix will track failures not using a list of "deadnodes", but rather a map of blocks to a list of deadnodes and- to preserve the retry semantics- a map of blocks to full acquisition failures. Right now, a datanode that fails to serve a block is blacklisted on the stream until there are no replicas available for some block, when the list is cleared. The false negatives this yields require the existing, problematic retry semantics. After confirming this approach with Koji, I'll file a JIRA for the more correct fix and submit the sufficient one for 0.17
          Hide
          Koji Noguchi added a comment -

          I just hit this on hadoop 0.15.3...

          Show
          Koji Noguchi added a comment - I just hit this on hadoop 0.15.3...
          Hide
          Koji Noguchi added a comment -

          This bug still happens after 0.14 upgrade.

          If this file is part of distcp, the job won't finish.

          Show
          Koji Noguchi added a comment - This bug still happens after 0.14 upgrade. If this file is part of distcp, the job won't finish.
          Hide
          dhruba borthakur added a comment -

          chooseDataNode() has bug that is triggered when all the replicas of a file are bad. The value of "failures" in DFSClient.chooseDataNode is always zero. When there are no more good nodes, bestNode() generates an exception that is caught inside chooseDataNode. "failures" is still zero; it clears the deadnodes, refetches block locations and starts all over again. Hence the infinite loop.

          Show
          dhruba borthakur added a comment - chooseDataNode() has bug that is triggered when all the replicas of a file are bad. The value of "failures" in DFSClient.chooseDataNode is always zero. When there are no more good nodes, bestNode() generates an exception that is caught inside chooseDataNode. "failures" is still zero; it clears the deadnodes, refetches block locations and starts all over again. Hence the infinite loop.

            People

            • Assignee:
              Chris Douglas
              Reporter:
              Koji Noguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development