Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3398

Client will not retry when primaryDN is down once it's just got pipeline

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0-alpha
    • Fix Version/s: 3.0.0, 2.0.2-alpha
    • Component/s: hdfs-client
    • Labels:
      None

      Description

      Scenario:
      =========
      Start NN and three DN"S

      Get the datanode to which blocks has to be replicated.
      from

      nodes = nextBlockOutputStream(src);
      
      

      Before start writing to the DN ,kill the primary DN.

      // write out data to remote datanode
                blockStream.write(buf.array(), buf.position(), buf.remaining());
                blockStream.flush();
      

      Now write will fail with the exception

      2012-05-10 14:21:47,993 WARN  hdfs.DFSClient (DFSOutputStream.java:run(552)) - DataStreamer Exception
      java.io.IOException: An established connection was aborted by the software in your host machine
      	at sun.nio.ch.SocketDispatcher.write0(Native Method)
      	at sun.nio.ch.SocketDispatcher.write(Unknown Source)
      	at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source)
      	at sun.nio.ch.IOUtil.write(Unknown Source)
      	at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
      	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:60)
      	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
      	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:151)
      	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:112)
      	at java.io.BufferedOutputStream.write(Unknown Source)
      	at java.io.DataOutputStream.write(Unknown Source)
      	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:513)
      
      

      .

      1. HDFS_3398_3.patch
        1 kB
        amith
      2. HDFS-3398.patch
        0.6 kB
        amith
      3. HDFS-3398.patch
        0.6 kB
        amith

        Activity

        Hide
        Todd Lipcon added a comment -

        Does this affect branch-1 as well?

        P.S please set "Target version" instead of "Fix version" for unfixed bugs.

        Show
        Todd Lipcon added a comment - Does this affect branch-1 as well? P.S please set "Target version" instead of "Fix version" for unfixed bugs.
        Hide
        Uma Maheswara Rao G added a comment -

        Seems to be a good catch Brahma.

        @Todd, It looks to be problem to me Todd. When writing on to socket if other peer goes down, it may treat that as client error and client will exit.
        How about catching socket operations and setting errorIndex to 1 (treating first node as bad)?

        I did not see the below check in 205 code.

        	 if (errorIndex == -1) { // not a datanode error
                    streamerClosed = true;
                  }
        	  

        205 code on throwable:

          } catch (Throwable e) {
                      LOG.warn("DataStreamer Exception: " + 
                               StringUtils.stringifyException(e));
                      if (e instanceof IOException) {
                        setLastException((IOException)e);
                      }
                      hasError = true;
                    }
                  }
         

        In trunk:

          } catch (Throwable e) {
                  DFSClient.LOG.warn("DataStreamer Exception", e);
                  if (e instanceof IOException) {
                    setLastException((IOException)e);
                  }
                  hasError = true;
                  if (errorIndex == -1) { // not a datanode error
                    streamerClosed = true;
                  }
                }
        
        Show
        Uma Maheswara Rao G added a comment - Seems to be a good catch Brahma. @Todd, It looks to be problem to me Todd. When writing on to socket if other peer goes down, it may treat that as client error and client will exit. How about catching socket operations and setting errorIndex to 1 (treating first node as bad)? I did not see the below check in 205 code. if (errorIndex == -1) { // not a datanode error streamerClosed = true ; } 205 code on throwable: } catch (Throwable e) { LOG.warn( "DataStreamer Exception: " + StringUtils.stringifyException(e)); if (e instanceof IOException) { setLastException((IOException)e); } hasError = true ; } } In trunk: } catch (Throwable e) { DFSClient.LOG.warn( "DataStreamer Exception" , e); if (e instanceof IOException) { setLastException((IOException)e); } hasError = true ; if (errorIndex == -1) { // not a datanode error streamerClosed = true ; } }
        Hide
        amith added a comment -

        I am submitting the patch removing the

        if (errorIndex == -1)
        

        check as we have ResponseProcessor to correctly identify the failure DN in the pipeline and to take sutable action.

        Show
        amith added a comment - I am submitting the patch removing the if (errorIndex == -1) check as we have ResponseProcessor to correctly identify the failure DN in the pipeline and to take sutable action.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12528208/HDFS-3398.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

        org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2487//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2487//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12528208/HDFS-3398.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2487//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2487//console This message is automatically generated.
        Hide
        amith added a comment -

        Attaching the patch

        Show
        amith added a comment - Attaching the patch
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12529871/HDFS_3398_3.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2525//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2525//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12529871/HDFS_3398_3.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2525//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2525//console This message is automatically generated.
        Hide
        amith added a comment -

        I am not able to write test since it require the primary DN down before blockstream.write(...) is called.

        Show
        amith added a comment - I am not able to write test since it require the primary DN down before blockstream.write(...) is called.
        Hide
        Uma Maheswara Rao G added a comment -

        Change look good to me. I agree, adding test for this change would be little difficult. Have you tested this manually with debug points?

        Show
        Uma Maheswara Rao G added a comment - Change look good to me. I agree, adding test for this change would be little difficult. Have you tested this manually with debug points?
        Hide
        amith added a comment -

        I manually tested the patch by breakpoints in debug mode
        Steps :
        1. Put a breakpoint in nodes=nextBlockOutputStream(), blockStream.write(...)
        2. Identify the DN primary DN selected, from nodes=nextBlockOutputStream() and when control reach before blockStream.write(...) kill the primary DN
        3. Now blockstream which is pointing to primary DN will not be able to send data so IOException will be thrown

        Result Without patch :
        Since in the catch block haserror is set and no errorIndex so we treat is as a client error and not DN error so client will stop.

        Result with patch :
        we are handling the IOException from the blockstream and set errorindex to primary DN and rethrowing the exception we have both errorIndex=0 and hasError=true so this is treated as DN failure not clirnt failure so client will try to update its pipeline, and continue writing.

        Show
        amith added a comment - I manually tested the patch by breakpoints in debug mode Steps : 1. Put a breakpoint in nodes=nextBlockOutputStream(), blockStream.write(...) 2. Identify the DN primary DN selected, from nodes=nextBlockOutputStream() and when control reach before blockStream.write(...) kill the primary DN 3. Now blockstream which is pointing to primary DN will not be able to send data so IOException will be thrown Result Without patch : Since in the catch block haserror is set and no errorIndex so we treat is as a client error and not DN error so client will stop. Result with patch : we are handling the IOException from the blockstream and set errorindex to primary DN and rethrowing the exception we have both errorIndex=0 and hasError=true so this is treated as DN failure not clirnt failure so client will try to update its pipeline, and continue writing.
        Hide
        Uma Maheswara Rao G added a comment -

        Thanks a lot Amith, for the test steps.
        +1 on the latest patch.

        Show
        Uma Maheswara Rao G added a comment - Thanks a lot Amith, for the test steps. +1 on the latest patch.
        Hide
        Uma Maheswara Rao G added a comment -

        I have just committed this to trunk and barnch-2. Thanks a lot, Amith for the patch.

        Show
        Uma Maheswara Rao G added a comment - I have just committed this to trunk and barnch-2. Thanks a lot, Amith for the patch.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #2371 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2371/)
        HDFS-3398. Client will not retry when primaryDN is down once it's just got pipeline. Contributed by Amith D K. (Revision 1343944)

        Result = SUCCESS
        umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343944
        Files :

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2371 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2371/ ) HDFS-3398 . Client will not retry when primaryDN is down once it's just got pipeline. Contributed by Amith D K. (Revision 1343944) Result = SUCCESS umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343944 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #2298 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2298/)
        HDFS-3398. Client will not retry when primaryDN is down once it's just got pipeline. Contributed by Amith D K. (Revision 1343944)

        Result = SUCCESS
        umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343944
        Files :

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2298 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2298/ ) HDFS-3398 . Client will not retry when primaryDN is down once it's just got pipeline. Contributed by Amith D K. (Revision 1343944) Result = SUCCESS umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343944 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #2317 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2317/)
        HDFS-3398. Client will not retry when primaryDN is down once it's just got pipeline. Contributed by Amith D K. (Revision 1343944)

        Result = FAILURE
        umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343944
        Files :

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2317 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2317/ ) HDFS-3398 . Client will not retry when primaryDN is down once it's just got pipeline. Contributed by Amith D K. (Revision 1343944) Result = FAILURE umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343944 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #1061 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1061/)
        HDFS-3398. Client will not retry when primaryDN is down once it's just got pipeline. Contributed by Amith D K. (Revision 1343944)

        Result = SUCCESS
        umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343944
        Files :

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1061 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1061/ ) HDFS-3398 . Client will not retry when primaryDN is down once it's just got pipeline. Contributed by Amith D K. (Revision 1343944) Result = SUCCESS umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343944 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #1095 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1095/)
        HDFS-3398. Client will not retry when primaryDN is down once it's just got pipeline. Contributed by Amith D K. (Revision 1343944)

        Result = SUCCESS
        umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343944
        Files :

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1095 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1095/ ) HDFS-3398 . Client will not retry when primaryDN is down once it's just got pipeline. Contributed by Amith D K. (Revision 1343944) Result = SUCCESS umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343944 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java

          People

          • Assignee:
            amith
            Reporter:
            Brahma Reddy Battula
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development