Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.17.0
    • Fix Version/s: 0.17.2
    • Component/s: None
    • Labels:
      None

      Description

      We had dfsClient -put hang outputting

      2008-06-28 10:05:12,595 WARN org.apache.hadoop.dfs.DFSClient: DataStreamer Exception: java.net.SocketTimeoutException:
      timed out waiting for rpc response
      2008-06-28 10:05:12,595 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block null bad datanode[0]
      2008-06-28 10:05:51,067 INFO org.apache.hadoop.dfs.DFSClient: Could not complete file
      /_temporary/_task_200806262325_4136_r_000408_0/part-00408
      retrying...
      2008-06-28 10:05:52,898 INFO org.apache.hadoop.dfs.DFSClient: Could not complete file
      /_temporary/_task_200806262325_4136_r_000408_0/part-00408
      retrying...
      2008-06-28 10:05:54,893 INFO org.apache.hadoop.dfs.DFSClient: Could not complete file
      /_temporary/_task_200806262325_4136_r_000408_0/part-00408
      retrying...
      2008-06-28 10:05:56,920 INFO org.apache.hadoop.dfs.DFSClient: Could not complete file
      /_temporary/_task_200806262325_4136_r_000408_0/part-00408
      retrying...
      2008-06-28 10:05:57,765 INFO org.apache.hadoop.dfs.DFSClient: Could not complete file
      /_temporary/_task_200806262325_4136_r_000408_0/part-00408
      retrying...
      2008-06-28 10:05:58,199 INFO org.apache.hadoop.dfs.DFSClient: Could not complete file
      /_temporary/_task_200806262325_4136_r_000408_0/part-00408
      retrying...
      [repeats forever]
      
      1. H-3681-jstack.txt
        7 kB
        Koji Noguchi
      2. HADOOP-3681-1.patch
        0.5 kB
        Lohit Vijayarenu
      3. HADOOP-3681-2.patch
        1 kB
        Lohit Vijayarenu
      4. HADOOP-3681-3-17.patch
        2 kB
        Lohit Vijayarenu
      5. HADOOP-3681-3-18.patch
        2 kB
        Lohit Vijayarenu
      6. HADOOP-3681-3-19.patch
        2 kB
        Lohit Vijayarenu
      7. HADOOP-3681-3-19.patch
        2 kB
        Lohit Vijayarenu

        Activity

        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )
        Hide
        Raghu Angadi added a comment -

        I just committed this. Thanks Lohit!

        Show
        Raghu Angadi added a comment - I just committed this. Thanks Lohit!
        Hide
        Lohit Vijayarenu added a comment -

        Looks like the patch build picked up latest file even if it was not marked for inclusion. It tried to apply 0.17 patch on trunk. Reattaching 0.19 patch and retrying hudson.

        Show
        Lohit Vijayarenu added a comment - Looks like the patch build picked up latest file even if it was not marked for inclusion. It tried to apply 0.17 patch on trunk. Reattaching 0.19 patch and retrying hudson.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12385262/HADOOP-3681-3-17.patch
        against trunk revision 674834.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2812/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12385262/HADOOP-3681-3-17.patch against trunk revision 674834. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2812/console This message is automatically generated.
        Hide
        Lohit Vijayarenu added a comment -

        I ran tests locally against trunk and branch-0.18 on my local machine. Both runs passed all tests.

        Show
        Lohit Vijayarenu added a comment - I ran tests locally against trunk and branch-0.18 on my local machine. Both runs passed all tests.
        Hide
        Robert Chansler added a comment -

        Should have been promotedearlier.

        Show
        Robert Chansler added a comment - Should have been promotedearlier.
        Hide
        dhruba borthakur added a comment -

        The third patch looks good. +1.

        Show
        dhruba borthakur added a comment - The third patch looks good. +1.
        Hide
        Koji Noguchi added a comment -

        +1 Tested manually (throwing IOException in the middle).

        With second patch

        bash-3.00$ ls -l testfile
        -rw-r--r--  1 knoguchi users 75396 Jul  4 01:46 testfile
        bash-3.00$ $HADOOP_HOME/bin/hadoop dfs -put testfile  /user/knoguchi
        08/07/04 02:13:43 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: testing
        08/07/04 02:13:43 WARN dfs.DFSClient: Error Recovery for block null bad datanode[0]
        bash-3.00$ echo $?
        0
        bash-3.00$ $HADOOP_HOME/bin/hadoop dfs -ls /user/knoguchi
        Found 1 items
        /user/knoguchi/testfile <r 1>   0       2008-07-04 02:13        rw-r--r--       knoguchi        supergroup
        

        With third patch,

        bash-3.00$ $HADOOP_HOME/bin/hadoop dfs -put testfile  /user/knoguchi
        08/07/04 02:15:03 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: testing
        08/07/04 02:15:03 WARN dfs.DFSClient: Error Recovery for block null bad datanode[0]
        put: Could not get block locations. Aborting...
        Exception closing file /user/knoguchi/testfile
        java.io.IOException: Could not get block locations. Aborting...
                at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2084)
                at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
                at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1822)
        bash-3.00$ echo $?
        255
        
        Show
        Koji Noguchi added a comment - +1 Tested manually (throwing IOException in the middle). With second patch bash-3.00$ ls -l testfile -rw-r--r-- 1 knoguchi users 75396 Jul 4 01:46 testfile bash-3.00$ $HADOOP_HOME/bin/hadoop dfs -put testfile /user/knoguchi 08/07/04 02:13:43 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: testing 08/07/04 02:13:43 WARN dfs.DFSClient: Error Recovery for block null bad datanode[0] bash-3.00$ echo $? 0 bash-3.00$ $HADOOP_HOME/bin/hadoop dfs -ls /user/knoguchi Found 1 items /user/knoguchi/testfile <r 1> 0 2008-07-04 02:13 rw-r--r-- knoguchi supergroup With third patch, bash-3.00$ $HADOOP_HOME/bin/hadoop dfs -put testfile /user/knoguchi 08/07/04 02:15:03 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: testing 08/07/04 02:15:03 WARN dfs.DFSClient: Error Recovery for block null bad datanode[0] put: Could not get block locations. Aborting... Exception closing file /user/knoguchi/testfile java.io.IOException: Could not get block locations. Aborting... at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2084) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1822) bash-3.00$ echo $? 255
        Hide
        Lohit Vijayarenu added a comment -

        Same patch for 0.17

        Show
        Lohit Vijayarenu added a comment - Same patch for 0.17
        Hide
        Lohit Vijayarenu added a comment -

        Same patch for 0.19 version.

        Show
        Lohit Vijayarenu added a comment - Same patch for 0.19 version.
        Hide
        Lohit Vijayarenu added a comment -

        Yes, there could be an exception thrown before locateFollowingBlock. So, we have to check lastException before setting close to true by calling isClosed(). As you said after flush is the right place. This updated patch also changes one testcase which used to dump the exception on stdout.

        Show
        Lohit Vijayarenu added a comment - Yes, there could be an exception thrown before locateFollowingBlock. So, we have to check lastException before setting close to true by calling isClosed(). As you said after flush is the right place. This updated patch also changes one testcase which used to dump the exception on stdout.
        Hide
        Koji Noguchi added a comment -

        Lohit, with your second patch, if DataStreamer thread throws an exception before locateFollowingBlock(startTime), hadoop dfs -put can incorrectly succeed ending up with empty dfs file.

        I don't know the detail of the dfs enough, but maybe we need to check for the error after queue is emptied?
        Can we call isClosed() at the bottom of flushInternal ?

        Show
        Koji Noguchi added a comment - Lohit, with your second patch, if DataStreamer thread throws an exception before locateFollowingBlock(startTime), hadoop dfs -put can incorrectly succeed ending up with empty dfs file. I don't know the detail of the dfs enough, but maybe we need to check for the error after queue is emptied? Can we call isClosed() at the bottom of flushInternal ?
        Hide
        Lohit Vijayarenu added a comment -

        Looks like hudson is not running core tests. I see same failure on another patch as well.

        Show
        Lohit Vijayarenu added a comment - Looks like hudson is not running core tests. I see same failure on another patch as well.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12385226/HADOOP-3681-2.patch
        against trunk revision 673517.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2793/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2793/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2793/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2793/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12385226/HADOOP-3681-2.patch against trunk revision 673517. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2793/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2793/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2793/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2793/console This message is automatically generated.
        Hide
        Robert Chansler added a comment -

        Need patches for both 17 and 18, if different.

        Show
        Robert Chansler added a comment - Need patches for both 17 and 18, if different.
        Hide
        Lohit Vijayarenu added a comment -

        Thanks Koji, Dhruba.
        Another try with this new patch. Here, we catch Throwable and isComplete() is called only after we try to complete file for 10 times. I think we should allow complete file to retry few times before checking for isClosed().

        Show
        Lohit Vijayarenu added a comment - Thanks Koji, Dhruba. Another try with this new patch. Here, we catch Throwable and isComplete() is called only after we try to complete file for 10 times. I think we should allow complete file to retry few times before checking for isClosed().
        Hide
        Koji Noguchi added a comment -

        Thanks Lohit!

        In the DataStreamer thread, can we catch Throwable instead of IOException?
        If OutOfMemoryError is thrown, it'll still hang.

        Show
        Koji Noguchi added a comment - Thanks Lohit! In the DataStreamer thread, can we catch Throwable instead of IOException? If OutOfMemoryError is thrown, it'll still hang.
        Hide
        dhruba borthakur added a comment -

        Hi Lohit, +1 on your patch.

        Show
        dhruba borthakur added a comment - Hi Lohit, +1 on your patch.
        Hide
        Lohit Vijayarenu added a comment -

        Thanks koji. I was also able to reproduce this by throwing exception after locateFollowingBlock. Looks like this is what happened

        • DFSClient timed out getting a new block from namenode, while namenode was busy. But in this case, namenode did allocate a block on behalf of the client.
        • This raised an exception and locateFollowingBlock returned exception eventually closing streamer
        • now closeInternal went pass isClosed() and was trying to complete the file.
        • namenode had a connection to client and so, did not expire the lease.

        Suggested fix is to call isClosed() while trying to complete the file. I tested this manually and it throws the exception stored in lastException and terminates the client.

        Show
        Lohit Vijayarenu added a comment - Thanks koji. I was also able to reproduce this by throwing exception after locateFollowingBlock. Looks like this is what happened DFSClient timed out getting a new block from namenode, while namenode was busy. But in this case, namenode did allocate a block on behalf of the client. This raised an exception and locateFollowingBlock returned exception eventually closing streamer now closeInternal went pass isClosed() and was trying to complete the file. namenode had a connection to client and so, did not expire the lease. Suggested fix is to call isClosed() while trying to complete the file. I tested this manually and it throws the exception stored in lastException and terminates the client.
        Hide
        Koji Noguchi added a comment -

        Trying to reproduce.

        1) Intentionally fail DataStreamer by throwing IOException right AFTER

        2219         lb = locateFollowingBlock(startTime);
        

        2) Add Thread.sleep(1000) at the top of DataStreamer thread run()
        so that DataStreamer would fail after flushInternal() line

        2524           isClosed();
        

        This will reproduce the hang state.

        Also, if datastreamer throws the IOException BEFORE that line, dfs -put would return '0' but ends up with empty file.

        Show
        Koji Noguchi added a comment - Trying to reproduce. 1) Intentionally fail DataStreamer by throwing IOException right AFTER 2219 lb = locateFollowingBlock(startTime); 2) Add Thread.sleep(1000) at the top of DataStreamer thread run() so that DataStreamer would fail after flushInternal() line 2524 isClosed(); This will reproduce the hang state. Also, if datastreamer throws the IOException BEFORE that line, dfs -put would return '0' but ends up with empty file.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Since the length of blk_4878955496501003583 is 0, could we let Fsck delete the block? Then, namenode.complete(...) will return normally.

        Although this does not fix the bug, it provides an option for system recovery.

        Show
        Tsz Wo Nicholas Sze added a comment - Since the length of blk_4878955496501003583 is 0, could we let Fsck delete the block? Then, namenode.complete(...) will return normally. Although this does not fix the bug, it provides an option for system recovery.
        Hide
        Koji Noguchi added a comment -

        Attaching jstack of the dfsclient.

        Show
        Koji Noguchi added a comment - Attaching jstack of the dfsclient.
        Hide
        Koji Noguchi added a comment -

        fsck -files -locations -blocks /_temporary/_task_200806262325_4136_r_000408_0/part-00408

        showing

        /_temporary/_task_200806262325_4136_r_000408_0/part-00408
        0 bytes, 1 block(s):  Replica placement policy is violated for blk_4878955496501003583. Block should be additionally
        replicated on 2 more rack(s).
         MISSING 1 blocks of total size 0 B
        0. blk_4878955496501003583 len=0 MISSING!
        
        Status: CORRUPT
        
        Show
        Koji Noguchi added a comment - fsck -files -locations -blocks /_temporary/_task_200806262325_4136_r_000408_0/part-00408 showing /_temporary/_task_200806262325_4136_r_000408_0/part-00408 0 bytes, 1 block(s): Replica placement policy is violated for blk_4878955496501003583. Block should be additionally replicated on 2 more rack(s). MISSING 1 blocks of total size 0 B 0. blk_4878955496501003583 len=0 MISSING! Status: CORRUPT
        Hide
        Koji Noguchi added a comment -

        Stack trace of the close.
        DataStreamer thread doesn't show up on jstack.

        "main" prio=10 tid=0x0805a800 nid=0x17a1 waiting on condition [0xf7e6c000..0xf7e6d1f8]
           java.lang.Thread.State: TIMED_WAITING (sleeping)
                at java.lang.Thread.sleep(Native Method)
                at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:2658)
                - locked <0xd524fc08> (a org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
                at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:2576)
                at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
                at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
                at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:94)
                - locked <0xd524fce0> (a org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter)
                at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:398)
                at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
        
        Show
        Koji Noguchi added a comment - Stack trace of the close. DataStreamer thread doesn't show up on jstack. "main" prio=10 tid=0x0805a800 nid=0x17a1 waiting on condition [0xf7e6c000..0xf7e6d1f8] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:2658) - locked <0xd524fc08> (a org.apache.hadoop.dfs.DFSClient$DFSOutputStream) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:2576) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79) at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:94) - locked <0xd524fce0> (a org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:398) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

          People

          • Assignee:
            Lohit Vijayarenu
            Reporter:
            Koji Noguchi
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development