Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6303

Read timeout when retrying a fetch error can be fatal to a reducer

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      If a reducer encounters an error trying to fetch from a node then encounters a read timeout when trying to re-establish the connection then the reducer can fail. The read timeout exception can leak to the top of the Fetcher thread which will cause the reduce task to teardown. This type of error can repeat across reducer attempts causing jobs to fail due to a single bad node.

        Issue Links

          Activity

          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Pulled this into 2.6.1. Ran compilation and TestFetcher before the push. Patch applied cleanly.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Pulled this into 2.6.1. Ran compilation and TestFetcher before the push. Patch applied cleanly.
          Hide
          l201514 Siqi Li added a comment -

          The patch applies to 2.6.0 cleanly.

          Show
          l201514 Siqi Li added a comment - The patch applies to 2.6.0 cleanly.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #2102 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2102/)
          MAPREDUCE-6303. Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          • hadoop-mapreduce-project/CHANGES.txt
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2102 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2102/ ) MAPREDUCE-6303 . Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java hadoop-mapreduce-project/CHANGES.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #153 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/153/)
          MAPREDUCE-6303. Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824)

          • hadoop-mapreduce-project/CHANGES.txt
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #153 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/153/ ) MAPREDUCE-6303 . Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824) hadoop-mapreduce-project/CHANGES.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2084 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2084/)
          MAPREDUCE-6303. Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          • hadoop-mapreduce-project/CHANGES.txt
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2084 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2084/ ) MAPREDUCE-6303 . Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java hadoop-mapreduce-project/CHANGES.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #143 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/143/)
          MAPREDUCE-6303. Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824)

          • hadoop-mapreduce-project/CHANGES.txt
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #143 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/143/ ) MAPREDUCE-6303 . Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824) hadoop-mapreduce-project/CHANGES.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #886 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/886/)
          MAPREDUCE-6303. Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          • hadoop-mapreduce-project/CHANGES.txt
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #886 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/886/ ) MAPREDUCE-6303 . Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java hadoop-mapreduce-project/CHANGES.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #152 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/152/)
          MAPREDUCE-6303. Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          • hadoop-mapreduce-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #152 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/152/ ) MAPREDUCE-6303 . Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java hadoop-mapreduce-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #7496 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7496/)
          MAPREDUCE-6303. Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          • hadoop-mapreduce-project/CHANGES.txt
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7496 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7496/ ) MAPREDUCE-6303 . Read timeout when retrying a fetch error can be fatal to a reducer. Contributed by Jason Lowe. (junping_du: rev eccb7d46efbf07abcc6a01bd5e7d682f6815b824) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java hadoop-mapreduce-project/CHANGES.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          Hide
          djp Junping Du added a comment -

          I have commit this patch to trunk, branch-2 and branch-2.7. Thanks Jason Lowe for contributing the patch!

          Show
          djp Junping Du added a comment - I have commit this patch to trunk, branch-2 and branch-2.7. Thanks Jason Lowe for contributing the patch!
          Hide
          djp Junping Du added a comment -

          +1. Patch LGTM. Will commit it shortly.

          Show
          djp Junping Du added a comment - +1. Patch LGTM. Will commit it shortly.
          Hide
          djp Junping Du added a comment -

          Junping Du, I would appreciate it if you could take a look. Thanks!

          Sure. I am looking at it now. Thanks!

          Show
          djp Junping Du added a comment - Junping Du, I would appreciate it if you could take a look. Thanks! Sure. I am looking at it now. Thanks!
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12708981/MAPREDUCE-6303.001.patch
          against trunk revision 867d5d2.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5366//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5366//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708981/MAPREDUCE-6303.001.patch against trunk revision 867d5d2. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5366//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5366//console This message is automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Patch that treats errors from setupConnectionsWithRetry when trying to re-establish the connection the same way we do errors when trying to establish the initial connection. Added a unit test that fails without the fix and passes afterwards.

          Junping Du, I would appreciate it if you could take a look. Thanks!

          Show
          jlowe Jason Lowe added a comment - Patch that treats errors from setupConnectionsWithRetry when trying to re-establish the connection the same way we do errors when trying to establish the initial connection. Added a unit test that fails without the fix and passes afterwards. Junping Du , I would appreciate it if you could take a look. Thanks!
          Hide
          jlowe Jason Lowe added a comment -

          Sample reduce log snippet showing the issue:

          2015-03-28 00:31:54,393 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#7
          	at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
          	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
          	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:415)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
          	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
          Caused by: java.net.SocketTimeoutException: Read timed out
          	at java.net.SocketInputStream.socketRead0(Native Method)
          	at java.net.SocketInputStream.read(SocketInputStream.java:150)
          	at java.net.SocketInputStream.read(SocketInputStream.java:121)
          	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
          	at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
          	at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
          	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:633)
          	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:579)
          	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1322)
          	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
          	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:427)
          	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:392)
          	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:338)
          	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
          
          2015-03-28 00:31:54,511 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task
          

          The problem is that the code caught an IOException trying to shuffle and within the catch block the code throws again which leaks up to the top of the Fetcher thread and kills the task.

          Show
          jlowe Jason Lowe added a comment - Sample reduce log snippet showing the issue: 2015-03-28 00:31:54,393 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#7 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:150) at java.net.SocketInputStream.read(SocketInputStream.java:121) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:633) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:579) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1322) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:427) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:392) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:338) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193) 2015-03-28 00:31:54,511 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task The problem is that the code caught an IOException trying to shuffle and within the catch block the code throws again which leaks up to the top of the Fetcher thread and kills the task.

            People

            • Assignee:
              jlowe Jason Lowe
              Reporter:
              jlowe Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development