The problem with changing the socket read timeout is Hadoop tasks can process at an arbitrarily rate, which means that mapper input data from Amazon S3 can be read at an arbitrary rate. There are two timeouts you can hit with Amazon S3 if you leave a socket open for long enough without pulling any data from it:
- You can hit a client side timeout, which is configurable, and appears as a SocketTimeoutException
- You can hit an Amazon S3 server side timeout, which is not configurable, and appears as a SocketException("Connection reset by peer").
Just increasing the client side timeout has 4 problems:
1. Increasing timeouts will keep the connection open longer, whereas what we're trying to do is give up the connection after a reasonable timeout, but then reopen it when we need it again. This way we're playing nicer with various system resources.
2. No matter what we put it at, one can imagine a task pulling data slower, and so encountering this exception
3. There is some value of the client side timeout above which all that happens is that we get a server side timeout instead
4. As a generalization you don't want client socket timeouts to be too big because it is always possible for a server to get "stuck" and stop sending data, in which case you want to recognize this failure in a timely manner via the timeout. (Not that Amazon S3 is known to have any such issues, but its best to be defensive in error handling).
Thus I now think the best solution is:
- Catch all IOExceptions and then retry once
- Keep the socket timeout at 60 seconds as it seems a reasonable trade-off between the cost of holding a connection open and the cost of reestablishing the connection.
I'll prepare a new patch.