Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.18.3, 0.19.2, 0.20.1
    • Fix Version/s: 0.21.0
    • Component/s: fs/s3
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      If a user's map function is CPU intensive and doesn't read from the input very quickly, compounded by the buffering of input, then S3 might think the connection has been lost and will close the connection. Then when the user attempts to read from the input again, they'll receive a SocketTimeoutException and the task will fail.

      1. HADOOP-6254-2.txt
        5 kB
        Andrew Hitchcock
      2. HADOOP-6254.diff
        2 kB
        Andrew Hitchcock
      3. HADOOP-6254.diff
        5 kB
        Andrew Hitchcock

        Activity

        Hide
        Ankush Bhatiya added a comment -

        Hi, Can we move this fix to hadoop 0.20.2 branch as well, Hive currently works with 0.20.2 branch and we want this fix there too.

        Show
        Ankush Bhatiya added a comment - Hi, Can we move this fix to hadoop 0.20.2 branch as well, Hive currently works with 0.20.2 branch and we want this fix there too.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk #182 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/182/)
        . Slow reads cause s3n to fail with SocketTimeoutException. Contributed by Andrew Hitchcock.

        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk #182 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/182/ ) . Slow reads cause s3n to fail with SocketTimeoutException. Contributed by Andrew Hitchcock.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #101 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/101/)
        . Slow reads cause s3n to fail with SocketTimeoutException. Contributed by Andrew Hitchcock.

        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #101 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/101/ ) . Slow reads cause s3n to fail with SocketTimeoutException. Contributed by Andrew Hitchcock.
        Hide
        Tom White added a comment -

        I've just committed this. Thanks Andrew!

        Show
        Tom White added a comment - I've just committed this. Thanks Andrew!
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12427397/HADOOP-6254-2.txt
        against trunk revision 888565.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427397/HADOOP-6254-2.txt against trunk revision 888565. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/23/console This message is automatically generated.
        Hide
        Andrew Hitchcock added a comment -

        New version of the patch. The only difference between the last patch is the addition of two lines of comments to address Tom's concerns.

        Show
        Andrew Hitchcock added a comment - New version of the patch. The only difference between the last patch is the addition of two lines of comments to address Tom's concerns.
        Hide
        Andrew Hitchcock added a comment -

        We could catch both of these Exceptions individually, but it is conceivable that other non-fatal IOExceptions could occur when reading from S3. I'd prefer having it catch all IOExceptions (which encompass SocketTimeoutException and SocketException).

        Show
        Andrew Hitchcock added a comment - We could catch both of these Exceptions individually, but it is conceivable that other non-fatal IOExceptions could occur when reading from S3. I'd prefer having it catch all IOExceptions (which encompass SocketTimeoutException and SocketException).
        Hide
        Chris Douglas added a comment -

        Cancelling while Tom's comments are addressed.

        Show
        Chris Douglas added a comment - Cancelling while Tom's comments are addressed.
        Hide
        Tom White added a comment -

        Thanks for the detailed explanation, Andrew. This looks good to me. I had couple of small comments:

        • Do we need to catch all IOExceptions, or would SocketTimeoutException and SocketException meet the needs of this issue?
        • Minor nit: would be nice to explain in a comment where 144 comes from in the test (128 + 16).
        Show
        Tom White added a comment - Thanks for the detailed explanation, Andrew. This looks good to me. I had couple of small comments: Do we need to catch all IOExceptions, or would SocketTimeoutException and SocketException meet the needs of this issue? Minor nit: would be nice to explain in a comment where 144 comes from in the test (128 + 16).
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12420090/HADOOP-6254.diff
        against trunk revision 816830.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12420090/HADOOP-6254.diff against trunk revision 816830. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/54/console This message is automatically generated.
        Hide
        Andrew Hitchcock added a comment -

        This patch incorporates the changes as per our discussion and my last comment. I've included a unit test as well. The test necessitated a little refactoring of NativeS3FileSystem.

        Show
        Andrew Hitchcock added a comment - This patch incorporates the changes as per our discussion and my last comment. I've included a unit test as well. The test necessitated a little refactoring of NativeS3FileSystem.
        Hide
        Andrew Hitchcock added a comment -

        Hi Tom,

        The problem with changing the socket read timeout is Hadoop tasks can process at an arbitrarily rate, which means that mapper input data from Amazon S3 can be read at an arbitrary rate. There are two timeouts you can hit with Amazon S3 if you leave a socket open for long enough without pulling any data from it:

        • You can hit a client side timeout, which is configurable, and appears as a SocketTimeoutException
        • You can hit an Amazon S3 server side timeout, which is not configurable, and appears as a SocketException("Connection reset by peer").

        Just increasing the client side timeout has 4 problems:

        1. Increasing timeouts will keep the connection open longer, whereas what we're trying to do is give up the connection after a reasonable timeout, but then reopen it when we need it again. This way we're playing nicer with various system resources.
        2. No matter what we put it at, one can imagine a task pulling data slower, and so encountering this exception
        3. There is some value of the client side timeout above which all that happens is that we get a server side timeout instead
        4. As a generalization you don't want client socket timeouts to be too big because it is always possible for a server to get "stuck" and stop sending data, in which case you want to recognize this failure in a timely manner via the timeout. (Not that Amazon S3 is known to have any such issues, but its best to be defensive in error handling).

        Thus I now think the best solution is:

        • Catch all IOExceptions and then retry once
        • Keep the socket timeout at 60 seconds as it seems a reasonable trade-off between the cost of holding a connection open and the cost of reestablishing the connection.

        I'll prepare a new patch.
        Thanks.

        Show
        Andrew Hitchcock added a comment - Hi Tom, The problem with changing the socket read timeout is Hadoop tasks can process at an arbitrarily rate, which means that mapper input data from Amazon S3 can be read at an arbitrary rate. There are two timeouts you can hit with Amazon S3 if you leave a socket open for long enough without pulling any data from it: You can hit a client side timeout, which is configurable, and appears as a SocketTimeoutException You can hit an Amazon S3 server side timeout, which is not configurable, and appears as a SocketException("Connection reset by peer"). Just increasing the client side timeout has 4 problems: 1. Increasing timeouts will keep the connection open longer, whereas what we're trying to do is give up the connection after a reasonable timeout, but then reopen it when we need it again. This way we're playing nicer with various system resources. 2. No matter what we put it at, one can imagine a task pulling data slower, and so encountering this exception 3. There is some value of the client side timeout above which all that happens is that we get a server side timeout instead 4. As a generalization you don't want client socket timeouts to be too big because it is always possible for a server to get "stuck" and stop sending data, in which case you want to recognize this failure in a timely manner via the timeout. (Not that Amazon S3 is known to have any such issues, but its best to be defensive in error handling). Thus I now think the best solution is: Catch all IOExceptions and then retry once Keep the socket timeout at 60 seconds as it seems a reasonable trade-off between the cost of holding a connection open and the cost of reestablishing the connection. I'll prepare a new patch. Thanks.
        Hide
        Tom White added a comment -

        You're right - your change doesn't introduce an infinite timeout. But it's not a particularly general-purpose solution since it effectively doubles the timeout (default 1 minute). It would be simpler to change the default value if lots of folks are hitting this.

        Also, any reason you didn't catch SocketTimeoutException directly, rather than catching IOException and re-throwing anything that's not SocketTimeoutException?

        Show
        Tom White added a comment - You're right - your change doesn't introduce an infinite timeout. But it's not a particularly general-purpose solution since it effectively doubles the timeout (default 1 minute). It would be simpler to change the default value if lots of folks are hitting this. Also, any reason you didn't catch SocketTimeoutException directly, rather than catching IOException and re-throwing anything that's not SocketTimeoutException?
        Hide
        Andrew Hitchcock added a comment -

        I don't believe the patch sets an infinite timeout. Upon reopening the connection, it tries once and if that fails then the call fails. It doesn't recurse.

        I think this is better than fiddling with the timeout because now users don't have to worry about adjusting another setting, this should just work out of the box. In my testing, I was able to reproduce the SocketTimeoutException and this patch reliably fixed it.

        Show
        Andrew Hitchcock added a comment - I don't believe the patch sets an infinite timeout. Upon reopening the connection, it tries once and if that fails then the call fails. It doesn't recurse. I think this is better than fiddling with the timeout because now users don't have to worry about adjusting another setting, this should just work out of the box. In my testing, I was able to reproduce the SocketTimeoutException and this patch reliably fixed it.
        Hide
        Tom White added a comment -

        Is it possible to set the read timeout on the socket? Setting httpclient.socket-timeout-ms looks like it would do it - http://jets3t.s3.amazonaws.com/toolkit/configuration.html. This would mean you can set the timeout to be large but not infinite (although you can make it infinite if you like by setting it to the value of -1), whereas the approach in your patch effectively sets an infinite timeout.

        Show
        Tom White added a comment - Is it possible to set the read timeout on the socket? Setting httpclient.socket-timeout-ms looks like it would do it - http://jets3t.s3.amazonaws.com/toolkit/configuration.html . This would mean you can set the timeout to be large but not infinite (although you can make it infinite if you like by setting it to the value of -1), whereas the approach in your patch effectively sets an infinite timeout.
        Hide
        Andrew Hitchcock added a comment -

        Here is my proposed fix. This patch doesn't include a unit test yet.

        Show
        Andrew Hitchcock added a comment - Here is my proposed fix. This patch doesn't include a unit test yet.

          People

          • Assignee:
            Andrew Hitchcock
            Reporter:
            Andrew Hitchcock
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development