Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.18.3, 0.19.2, 0.20.1
    • Fix Version/s: 0.21.0
    • Component/s: fs/s3
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      If a user's map function is CPU intensive and doesn't read from the input very quickly, compounded by the buffering of input, then S3 might think the connection has been lost and will close the connection. Then when the user attempts to read from the input again, they'll receive a SocketTimeoutException and the task will fail.

      1. HADOOP-6254.diff
        5 kB
        Andrew Hitchcock
      2. HADOOP-6254.diff
        2 kB
        Andrew Hitchcock
      3. HADOOP-6254-2.txt
        5 kB
        Andrew Hitchcock

        Issue Links

          Activity

          Hide
          Ankush Bhatiya added a comment -

          Hi, Can we move this fix to hadoop 0.20.2 branch as well, Hive currently works with 0.20.2 branch and we want this fix there too.

          Show
          Ankush Bhatiya added a comment - Hi, Can we move this fix to hadoop 0.20.2 branch as well, Hive currently works with 0.20.2 branch and we want this fix there too.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #182 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/182/)
          . Slow reads cause s3n to fail with SocketTimeoutException. Contributed by Andrew Hitchcock.

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #182 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/182/ ) . Slow reads cause s3n to fail with SocketTimeoutException. Contributed by Andrew Hitchcock.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #101 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/101/)
          . Slow reads cause s3n to fail with SocketTimeoutException. Contributed by Andrew Hitchcock.

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #101 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/101/ ) . Slow reads cause s3n to fail with SocketTimeoutException. Contributed by Andrew Hitchcock.
          Hide
          Tom White added a comment -

          I've just committed this. Thanks Andrew!

          Show
          Tom White added a comment - I've just committed this. Thanks Andrew!
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427397/HADOOP-6254-2.txt
          against trunk revision 888565.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427397/HADOOP-6254-2.txt against trunk revision 888565. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/console This message is automatically generated.
          Hide
          Andrew Hitchcock added a comment -

          New version of the patch. The only difference between the last patch is the addition of two lines of comments to address Tom's concerns.

          Show
          Andrew Hitchcock added a comment - New version of the patch. The only difference between the last patch is the addition of two lines of comments to address Tom's concerns.
          Hide
          Andrew Hitchcock added a comment -

          We could catch both of these Exceptions individually, but it is conceivable that other non-fatal IOExceptions could occur when reading from S3. I'd prefer having it catch all IOExceptions (which encompass SocketTimeoutException and SocketException).

          Show
          Andrew Hitchcock added a comment - We could catch both of these Exceptions individually, but it is conceivable that other non-fatal IOExceptions could occur when reading from S3. I'd prefer having it catch all IOExceptions (which encompass SocketTimeoutException and SocketException).
          Hide
          Chris Douglas added a comment -

          Cancelling while Tom's comments are addressed.

          Show
          Chris Douglas added a comment - Cancelling while Tom's comments are addressed.
          Hide
          Tom White added a comment -

          Thanks for the detailed explanation, Andrew. This looks good to me. I had couple of small comments:

          • Do we need to catch all IOExceptions, or would SocketTimeoutException and SocketException meet the needs of this issue?
          • Minor nit: would be nice to explain in a comment where 144 comes from in the test (128 + 16).
          Show
          Tom White added a comment - Thanks for the detailed explanation, Andrew. This looks good to me. I had couple of small comments: Do we need to catch all IOExceptions, or would SocketTimeoutException and SocketException meet the needs of this issue? Minor nit: would be nice to explain in a comment where 144 comes from in the test (128 + 16).
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12420090/HADOOP-6254.diff
          against trunk revision 816830.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12420090/HADOOP-6254.diff against trunk revision 816830. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/console This message is automatically generated.
          Hide
          Andrew Hitchcock added a comment -

          This patch incorporates the changes as per our discussion and my last comment. I've included a unit test as well. The test necessitated a little refactoring of NativeS3FileSystem.

          Show
          Andrew Hitchcock added a comment - This patch incorporates the changes as per our discussion and my last comment. I've included a unit test as well. The test necessitated a little refactoring of NativeS3FileSystem.
          Hide
          Andrew Hitchcock added a comment -

          Hi Tom,

          The problem with changing the socket read timeout is Hadoop tasks can process at an arbitrarily rate, which means that mapper input data from Amazon S3 can be read at an arbitrary rate. There are two timeouts you can hit with Amazon S3 if you leave a socket open for long enough without pulling any data from it:

          • You can hit a client side timeout, which is configurable, and appears as a SocketTimeoutException
          • You can hit an Amazon S3 server side timeout, which is not configurable, and appears as a SocketException("Connection reset by peer").

          Just increasing the client side timeout has 4 problems:

          1. Increasing timeouts will keep the connection open longer, whereas what we're trying to do is give up the connection after a reasonable timeout, but then reopen it when we need it again. This way we're playing nicer with various system resources.
          2. No matter what we put it at, one can imagine a task pulling data slower, and so encountering this exception
          3. There is some value of the client side timeout above which all that happens is that we get a server side timeout instead
          4. As a generalization you don't want client socket timeouts to be too big because it is always possible for a server to get "stuck" and stop sending data, in which case you want to recognize this failure in a timely manner via the timeout. (Not that Amazon S3 is known to have any such issues, but its best to be defensive in error handling).

          Thus I now think the best solution is:

          • Catch all IOExceptions and then retry once
          • Keep the socket timeout at 60 seconds as it seems a reasonable trade-off between the cost of holding a connection open and the cost of reestablishing the connection.

          I'll prepare a new patch.
          Thanks.

          Show
          Andrew Hitchcock added a comment - Hi Tom, The problem with changing the socket read timeout is Hadoop tasks can process at an arbitrarily rate, which means that mapper input data from Amazon S3 can be read at an arbitrary rate. There are two timeouts you can hit with Amazon S3 if you leave a socket open for long enough without pulling any data from it: You can hit a client side timeout, which is configurable, and appears as a SocketTimeoutException You can hit an Amazon S3 server side timeout, which is not configurable, and appears as a SocketException("Connection reset by peer"). Just increasing the client side timeout has 4 problems: 1. Increasing timeouts will keep the connection open longer, whereas what we're trying to do is give up the connection after a reasonable timeout, but then reopen it when we need it again. This way we're playing nicer with various system resources. 2. No matter what we put it at, one can imagine a task pulling data slower, and so encountering this exception 3. There is some value of the client side timeout above which all that happens is that we get a server side timeout instead 4. As a generalization you don't want client socket timeouts to be too big because it is always possible for a server to get "stuck" and stop sending data, in which case you want to recognize this failure in a timely manner via the timeout. (Not that Amazon S3 is known to have any such issues, but its best to be defensive in error handling). Thus I now think the best solution is: Catch all IOExceptions and then retry once Keep the socket timeout at 60 seconds as it seems a reasonable trade-off between the cost of holding a connection open and the cost of reestablishing the connection. I'll prepare a new patch. Thanks.
          Hide
          Tom White added a comment -

          You're right - your change doesn't introduce an infinite timeout. But it's not a particularly general-purpose solution since it effectively doubles the timeout (default 1 minute). It would be simpler to change the default value if lots of folks are hitting this.

          Also, any reason you didn't catch SocketTimeoutException directly, rather than catching IOException and re-throwing anything that's not SocketTimeoutException?

          Show
          Tom White added a comment - You're right - your change doesn't introduce an infinite timeout. But it's not a particularly general-purpose solution since it effectively doubles the timeout (default 1 minute). It would be simpler to change the default value if lots of folks are hitting this. Also, any reason you didn't catch SocketTimeoutException directly, rather than catching IOException and re-throwing anything that's not SocketTimeoutException?
          Hide
          Andrew Hitchcock added a comment -

          I don't believe the patch sets an infinite timeout. Upon reopening the connection, it tries once and if that fails then the call fails. It doesn't recurse.

          I think this is better than fiddling with the timeout because now users don't have to worry about adjusting another setting, this should just work out of the box. In my testing, I was able to reproduce the SocketTimeoutException and this patch reliably fixed it.

          Show
          Andrew Hitchcock added a comment - I don't believe the patch sets an infinite timeout. Upon reopening the connection, it tries once and if that fails then the call fails. It doesn't recurse. I think this is better than fiddling with the timeout because now users don't have to worry about adjusting another setting, this should just work out of the box. In my testing, I was able to reproduce the SocketTimeoutException and this patch reliably fixed it.
          Hide
          Tom White added a comment -

          Is it possible to set the read timeout on the socket? Setting httpclient.socket-timeout-ms looks like it would do it - http://jets3t.s3.amazonaws.com/toolkit/configuration.html. This would mean you can set the timeout to be large but not infinite (although you can make it infinite if you like by setting it to the value of -1), whereas the approach in your patch effectively sets an infinite timeout.

          Show
          Tom White added a comment - Is it possible to set the read timeout on the socket? Setting httpclient.socket-timeout-ms looks like it would do it - http://jets3t.s3.amazonaws.com/toolkit/configuration.html . This would mean you can set the timeout to be large but not infinite (although you can make it infinite if you like by setting it to the value of -1), whereas the approach in your patch effectively sets an infinite timeout.
          Hide
          Andrew Hitchcock added a comment -

          Here is my proposed fix. This patch doesn't include a unit test yet.

          Show
          Andrew Hitchcock added a comment - Here is my proposed fix. This patch doesn't include a unit test yet.

            People

            • Assignee:
              Andrew Hitchcock
              Reporter:
              Andrew Hitchcock
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development