Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.13.1, 0.14.3
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      [knoguchi]$ hadoop dfs -cat fileA
      07/09/13 17:36:02 INFO fs.DFSClient: Could not obtain block 0 from any node:
      java.io.IOException: No live nodes contain current block
      07/09/13 17:36:20 INFO fs.DFSClient: Could not obtain block 0 from any node:
      java.io.IOException: No live nodes contain current block
      [repeats forever]

      Setting one of the Debug statement to Warn, it kept on showing

       
       WARN org.apache.hadoop.fs.DFSClient: Failed to connect
      to /99.99.999.9 :11111:java.io.IOException: Recorded block size is 7496, but
      datanode reports size of 0
      	at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:690)
      	at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:771)
      	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
      	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
      	at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
      	at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
      	at java.io.DataInputStream.readFully(DataInputStream.java:178)
      	at java.io.DataInputStream.readFully(DataInputStream.java:152)
      	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:123)
      	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:340)
      	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:259)
      	at org.apache.hadoop.util.CopyFiles$FSCopyFilesMapper.map(CopyFiles.java:466)
      	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
      	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:186)
      	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1707)
      

      Turns out fileA was corrupted. Fsck showed crc file of 7496 bytes, but when I searched for the blocks on each node, 3 replicas were all size 0.

      Not sure how it got corrupted, but it would be nice if the dfs command fail instead of getting into an infinite loop.

      1. 1911-0.patch
        0.8 kB
        Chris Douglas

        Issue Links

          Activity

          Koji Noguchi created issue -
          Hide
          dhruba borthakur added a comment -

          chooseDataNode() has bug that is triggered when all the replicas of a file are bad. The value of "failures" in DFSClient.chooseDataNode is always zero. When there are no more good nodes, bestNode() generates an exception that is caught inside chooseDataNode. "failures" is still zero; it clears the deadnodes, refetches block locations and starts all over again. Hence the infinite loop.

          Show
          dhruba borthakur added a comment - chooseDataNode() has bug that is triggered when all the replicas of a file are bad. The value of "failures" in DFSClient.chooseDataNode is always zero. When there are no more good nodes, bestNode() generates an exception that is caught inside chooseDataNode. "failures" is still zero; it clears the deadnodes, refetches block locations and starts all over again. Hence the infinite loop.
          Hide
          Koji Noguchi added a comment -

          This bug still happens after 0.14 upgrade.

          If this file is part of distcp, the job won't finish.

          Show
          Koji Noguchi added a comment - This bug still happens after 0.14 upgrade. If this file is part of distcp, the job won't finish.
          Koji Noguchi made changes -
          Field Original Value New Value
          Fix Version/s 0.16.0 [ 12312740 ]
          Affects Version/s 0.14.3 [ 12312830 ]
          Nigel Daley made changes -
          Fix Version/s 0.16.0 [ 12312740 ]
          Koji Noguchi made changes -
          Link This issue is related to HADOOP-83 [ HADOOP-83 ]
          Hide
          Koji Noguchi added a comment -

          I just hit this on hadoop 0.15.3...

          Show
          Koji Noguchi added a comment - I just hit this on hadoop 0.15.3...
          Robert Chansler made changes -
          Priority Major [ 3 ] Blocker [ 1 ]
          Robert Chansler made changes -
          Fix Version/s 0.17.0 [ 12312913 ]
          Robert Chansler made changes -
          Assignee Chris Douglas [ chris.douglas ]
          Hide
          Chris Douglas added a comment -

          There is a very simple "fix" to this, i.e. make the "failures" count an instance var on DFSInputStream rather than a local variable in chooseDataNode. This would make the semantics of MAX_BLOCK_ACQUIRE_FAILURES to be a cap on the number of total block acquisition failures for the life of the stream, which is not exactly correct, but it is a fix we could easily get into 0.17. It will yield false negatives for a particularly problematic stream, but for applications like distcp it should be sufficient.

          After consulting with Dhruba, the longer-term fix will track failures not using a list of "deadnodes", but rather a map of blocks to a list of deadnodes and- to preserve the retry semantics- a map of blocks to full acquisition failures. Right now, a datanode that fails to serve a block is blacklisted on the stream until there are no replicas available for some block, when the list is cleared. The false negatives this yields require the existing, problematic retry semantics. After confirming this approach with Koji, I'll file a JIRA for the more correct fix and submit the sufficient one for 0.17

          Show
          Chris Douglas added a comment - There is a very simple "fix" to this, i.e. make the "failures" count an instance var on DFSInputStream rather than a local variable in chooseDataNode. This would make the semantics of MAX_BLOCK_ACQUIRE_FAILURES to be a cap on the number of total block acquisition failures for the life of the stream, which is not exactly correct, but it is a fix we could easily get into 0.17. It will yield false negatives for a particularly problematic stream, but for applications like distcp it should be sufficient. After consulting with Dhruba, the longer-term fix will track failures not using a list of "deadnodes", but rather a map of blocks to a list of deadnodes and- to preserve the retry semantics- a map of blocks to full acquisition failures. Right now, a datanode that fails to serve a block is blacklisted on the stream until there are no replicas available for some block, when the list is cleared. The false negatives this yields require the existing, problematic retry semantics. After confirming this approach with Koji, I'll file a JIRA for the more correct fix and submit the sufficient one for 0.17
          Chris Douglas made changes -
          Attachment 1911-0.patch [ 12379436 ]
          Chris Douglas made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Chris Douglas added a comment -

          I ran the Hudson validation on my dev box and- excluding the absence of unit tests- everything passes. If someone can +1 this then I can commit it and file a separate JIRA for a unit test.

          Show
          Chris Douglas added a comment - I ran the Hudson validation on my dev box and- excluding the absence of unit tests- everything passes. If someone can +1 this then I can commit it and file a separate JIRA for a unit test.
          Hide
          dhruba borthakur added a comment -

          This code change looks good.

          Show
          dhruba borthakur added a comment - This code change looks good.
          Hide
          Chris Douglas added a comment -

          I just committed this.

          Show
          Chris Douglas added a comment - I just committed this.
          Chris Douglas made changes -
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Chris Douglas made changes -
          Link This issue is related to HADOOP-3185 [ HADOOP-3185 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12379436/1911-0.patch
          against trunk revision 643282.

          @author +1. The patch does not contain any @author tags.

          tests included -1. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379436/1911-0.patch against trunk revision 643282. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2170/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          The test that failed is related to HADOOP-3139 (see Nicholas's last comment). The count of threads tracking the creation of lease threads has a race that only showed up in our tests for Java 1.5 on mac and linux on 0.16, but not in Java 1.6 or in trunk. Hudson might be hitting the same condition with trunk. We might consider putting the sleep in the test into trunk, as we did for 0.16.

          Show
          Chris Douglas added a comment - The test that failed is related to HADOOP-3139 (see Nicholas's last comment). The count of threads tracking the creation of lease threads has a race that only showed up in our tests for Java 1.5 on mac and linux on 0.16, but not in Java 1.6 or in trunk. Hudson might be hitting the same condition with trunk. We might consider putting the sleep in the test into trunk, as we did for 0.16.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #451 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/451/ )
          Nigel Daley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Igor Bolotin made changes -
          Link This issue is related to HADOOP-4681 [ HADOOP-4681 ]
          Owen O'Malley made changes -
          Component/s dfs [ 12310710 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          199d 23h 46m 1 Chris Douglas 04/Apr/08 22:56
          Patch Available Patch Available Resolved Resolved
          1h 35m 1 Chris Douglas 05/Apr/08 00:31
          Resolved Resolved Closed Closed
          46d 20h 33m 1 Nigel Daley 21/May/08 21:05

            People

            • Assignee:
              Chris Douglas
              Reporter:
              Koji Noguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development