Hadoop Common
  1. Hadoop Common
  2. HADOOP-882

S3FileSystem should retry if there is a communication problem with S3

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.1
    • Fix Version/s: 0.12.0
    • Component/s: fs
    • Labels:
      None

      Description

      File system operations currently fail if there is a communication problem (IOException) with S3. All operations that communicate with S3 should retry a fixed number of times before failing.

      1. jets3t-0.5.0.jar
        249 kB
        stack
      2. jets3t-upgrade.patch
        3 kB
        stack

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Michael!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Michael!
          Hide
          stack added a comment -

          Nigel, I'm guessing the patch application failed because it doesn't incorporate removal of the jar $

          {HADOOP_HOME}

          /lib/jets3t.jar and replacement with the attached jets3t-0.5.0.jar. Without the latter lib, the patch won't work. Thanks.

          Show
          stack added a comment - Nigel, I'm guessing the patch application failed because it doesn't incorporate removal of the jar $ {HADOOP_HOME} /lib/jets3t.jar and replacement with the attached jets3t-0.5.0.jar. Without the latter lib, the patch won't work. Thanks.
          Hide
          Hadoop QA added a comment -

          -1, because 3 attempts failed to build and test the latest attachment (http://issues.apache.org/jira/secure/attachment/12350614/jets3t-upgrade.patch) against trunk revision r504682. Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

          Show
          Hadoop QA added a comment - -1, because 3 attempts failed to build and test the latest attachment ( http://issues.apache.org/jira/secure/attachment/12350614/jets3t-upgrade.patch ) against trunk revision r504682. Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.
          Hide
          Tom White added a comment -

          +1

          I have tested the patch successfully using Jets3tS3FileSystemTest and think it is ready for inclusion. This patch will certainly improve the reliability of S3 "metadata" operations since they fit into the jets3t buffer limit. Block read and write failures aren't retried effectively, so for these cases I have created HADOOP-997 as a follow on issue.

          Show
          Tom White added a comment - +1 I have tested the patch successfully using Jets3tS3FileSystemTest and think it is ready for inclusion. This patch will certainly improve the reliability of S3 "metadata" operations since they fit into the jets3t buffer limit. Block read and write failures aren't retried effectively, so for these cases I have created HADOOP-997 as a follow on issue.
          Hide
          stack added a comment -

          Here's a patch that makes the minor changes necessary so the s3 implementation can use the new 0.5.0 jets3t 'retrying' lib. It also exposes fs.s3.block.size in hadoop-default.xml with a note about how to set the jets3t RepeatableInputStream buffer size by adding a jets3t.properties to $

          {HADOOP_HOME}

          /conf. Setting this latter buffer to the same as the s3 block size avoids failures of the kind 'Input stream is not repeatable as 1048576 bytes have been written, exceeding the available buffer size of 131072'.

          Downside to this patch's approach is that if you want to match block and buffer size, you need to set the same value in two places: once in hadoop-site and again in jets3t.properties. This seemed to be me better than the alternative, a tighter coupling bubbling the main jets3t properties up into hadoop-*.xml filesystem section as fs.s3.jets3t.XXX properties with the init of the s3 filesystem setting the values into the org.jets3t.service.Jets3tProperties.

          I didn't change the default S3 block size from 1MB. Setting it to 64MB seems too far afield from the default jets3t RepeatableInputStream size of 100k only.

          I've included the 0.5.0 jets3t lib as part of the upload (There doesn't seem to be a way to include binaries using svn diff). Its license is apache 2.

          Tom White, thanks for pointing me at the unit test. Also, I'd go along with closing this issue with the update of jets3t lib opening another issue for tracking the S3 filesystems implementing a general, 'traffic-level' hadoop retry mechanism.

          Show
          stack added a comment - Here's a patch that makes the minor changes necessary so the s3 implementation can use the new 0.5.0 jets3t 'retrying' lib. It also exposes fs.s3.block.size in hadoop-default.xml with a note about how to set the jets3t RepeatableInputStream buffer size by adding a jets3t.properties to $ {HADOOP_HOME} /conf. Setting this latter buffer to the same as the s3 block size avoids failures of the kind 'Input stream is not repeatable as 1048576 bytes have been written, exceeding the available buffer size of 131072'. Downside to this patch's approach is that if you want to match block and buffer size, you need to set the same value in two places: once in hadoop-site and again in jets3t.properties. This seemed to be me better than the alternative, a tighter coupling bubbling the main jets3t properties up into hadoop-*.xml filesystem section as fs.s3.jets3t.XXX properties with the init of the s3 filesystem setting the values into the org.jets3t.service.Jets3tProperties. I didn't change the default S3 block size from 1MB. Setting it to 64MB seems too far afield from the default jets3t RepeatableInputStream size of 100k only. I've included the 0.5.0 jets3t lib as part of the upload (There doesn't seem to be a way to include binaries using svn diff). Its license is apache 2. Tom White, thanks for pointing me at the unit test. Also, I'd go along with closing this issue with the update of jets3t lib opening another issue for tracking the S3 filesystems implementing a general, 'traffic-level' hadoop retry mechanism.
          Hide
          Tom White added a comment -

          The best way to check the new jets3t library is to run the Jets3tS3FileSystemTest unit test. You will need to set your S3 credentials in the hadoop-site.xml file in the test directory. If this passes you can be confident that the upgrade has worked.

          Changing the buffer size to be as big as the block size sounds good (however, I worry a little whether there could be a memory issue if jets3t buffers in memory, as seems likely).

          The 1MB block size was a fairly arbitrary value I selected during testing. I agree that 64MB would be better. The property fs.s3.block.size property needs adding to hadoop-default.xml, and the DEFAULT_BLOCK_SIZE constant in S3FileSystem needs changing too.

          A patch for all this stuff would be very welcome!

          As for whether completing all these items would mean the issue is closed I'm not sure. The jets3t retry mechanism is for S3-level exceptions, if there is a traffic-level communication problem then there is nothing to handle it. For this, we could use the more general Hadoop-level mechanism, described above (or possibly use the retry-mechanism in HttpClient, if that's sufficient). I think this work would belong in another Jira issue. Thoughts?

          Show
          Tom White added a comment - The best way to check the new jets3t library is to run the Jets3tS3FileSystemTest unit test. You will need to set your S3 credentials in the hadoop-site.xml file in the test directory. If this passes you can be confident that the upgrade has worked. Changing the buffer size to be as big as the block size sounds good (however, I worry a little whether there could be a memory issue if jets3t buffers in memory, as seems likely). The 1MB block size was a fairly arbitrary value I selected during testing. I agree that 64MB would be better. The property fs.s3.block.size property needs adding to hadoop-default.xml, and the DEFAULT_BLOCK_SIZE constant in S3FileSystem needs changing too. A patch for all this stuff would be very welcome! As for whether completing all these items would mean the issue is closed I'm not sure. The jets3t retry mechanism is for S3-level exceptions, if there is a traffic-level communication problem then there is nothing to handle it. For this, we could use the more general Hadoop-level mechanism, described above (or possibly use the retry-mechanism in HttpClient, if that's sufficient). I think this work would belong in another Jira issue. Thoughts?
          Hide
          stack added a comment -

          I updated jets3t to the 0.5.0 release. I had to make the below edits. The API has probably changed in other ways but I've not spent the time verifying. Unless someone else is working on a patch that includes the new version of the jets3t lib and complimentary changes to s3 fs, I can give it a go (retries are necessary it seems if you're trying to upload anything more than a few kilobytes).

          Related, after adding the new lib and making below changes, uploads ('puts') would fail with below complaint.

          07/01/31 00:47:13 WARN service.S3Service: Encountered 1 S3 Internal Server error(s), will retry in 50ms
          put: Input stream is not repeatable as 1048576 bytes have been written, exceeding the available buffer size of 131072

          I found the 131072 buffer in jets3t. Turns out the buffer size is configurable. Dropping a jets3t.properties file into $

          {HADOOP_HOME}

          /conf directory (so a lookup on CLASSPATH succeeds) with amended s3service.stream-retry-buffer-size got me over the 'put: Input stream...' hump. I set it to the value of dfs.block.size so it could replay a full-block if it had to.

          Then I noticed that the blocks written to S3 were of 1MB in size. I'm uploading tens of Gs so that made for tens of thousands of blocks. No harm I suppose but I was a little stumped that block size in S3 wasn't the value of dfs.block.size. I found the fs.s3.block.size property in the S3 fs code. Shouldn't this setting be bubbled up into hadoop-default with a default of value of $

          {dfs.block.size}

          ? (Setting this in my config. made for 64MB S3 blocks).

          I can add the latter items to the wiki on S3 or can include an jets3t.properties and s3.block.size in patch. What do others think?

          Index: src/java/org/apache/hadoop/fs/s3/Jets3tFileSystemStore.java
          ===================================================================
          — src/java/org/apache/hadoop/fs/s3/Jets3tFileSystemStore.java (revision 501895)
          +++ src/java/org/apache/hadoop/fs/s3/Jets3tFileSystemStore.java (working copy)
          @@ -133,7 +133,7 @@
          S3Object object = s3Service.getObject(bucket, key);
          return object.getDataInputStream();
          } catch (S3ServiceException e) {

          • if (e.getErrorCode().equals("NoSuchKey")) {
            + if (e.getS3ErrorCode().equals("NoSuchKey")) { return null; }
            if (e.getCause() instanceof IOException) { @@ -149,7 +149,7 @@ null, byteRangeStart, null); return object.getDataInputStream(); } catch (S3ServiceException e) {
            - if (e.getErrorCode().equals("NoSuchKey")) {
            + if (e.getS3ErrorCode().equals("NoSuchKey")) { return null; }

            if (e.getCause() instanceof IOException) {

          Show
          stack added a comment - I updated jets3t to the 0.5.0 release. I had to make the below edits. The API has probably changed in other ways but I've not spent the time verifying. Unless someone else is working on a patch that includes the new version of the jets3t lib and complimentary changes to s3 fs, I can give it a go (retries are necessary it seems if you're trying to upload anything more than a few kilobytes). Related, after adding the new lib and making below changes, uploads ('puts') would fail with below complaint. 07/01/31 00:47:13 WARN service.S3Service: Encountered 1 S3 Internal Server error(s), will retry in 50ms put: Input stream is not repeatable as 1048576 bytes have been written, exceeding the available buffer size of 131072 I found the 131072 buffer in jets3t. Turns out the buffer size is configurable. Dropping a jets3t.properties file into $ {HADOOP_HOME} /conf directory (so a lookup on CLASSPATH succeeds) with amended s3service.stream-retry-buffer-size got me over the 'put: Input stream...' hump. I set it to the value of dfs.block.size so it could replay a full-block if it had to. Then I noticed that the blocks written to S3 were of 1MB in size. I'm uploading tens of Gs so that made for tens of thousands of blocks. No harm I suppose but I was a little stumped that block size in S3 wasn't the value of dfs.block.size. I found the fs.s3.block.size property in the S3 fs code. Shouldn't this setting be bubbled up into hadoop-default with a default of value of $ {dfs.block.size} ? (Setting this in my config. made for 64MB S3 blocks). I can add the latter items to the wiki on S3 or can include an jets3t.properties and s3.block.size in patch. What do others think? Index: src/java/org/apache/hadoop/fs/s3/Jets3tFileSystemStore.java =================================================================== — src/java/org/apache/hadoop/fs/s3/Jets3tFileSystemStore.java (revision 501895) +++ src/java/org/apache/hadoop/fs/s3/Jets3tFileSystemStore.java (working copy) @@ -133,7 +133,7 @@ S3Object object = s3Service.getObject(bucket, key); return object.getDataInputStream(); } catch (S3ServiceException e) { if (e.getErrorCode().equals("NoSuchKey")) { + if (e.getS3ErrorCode().equals("NoSuchKey")) { return null; } if (e.getCause() instanceof IOException) { @@ -149,7 +149,7 @@ null, byteRangeStart, null); return object.getDataInputStream(); } catch (S3ServiceException e) { - if (e.getErrorCode().equals("NoSuchKey")) { + if (e.getS3ErrorCode().equals("NoSuchKey")) { return null; } if (e.getCause() instanceof IOException) {
          Hide
          Tom White added a comment -

          Looks like jets3t version 0.5.0 was released yesterday: http://jets3t.s3.amazonaws.com/downloads.html.
          Configuration notes are here: http://jets3t.s3.amazonaws.com/toolkit/configuration.html - s3service.internal-error-retry-max is the relevant one.

          Show
          Tom White added a comment - Looks like jets3t version 0.5.0 was released yesterday: http://jets3t.s3.amazonaws.com/downloads.html . Configuration notes are here: http://jets3t.s3.amazonaws.com/toolkit/configuration.html - s3service.internal-error-retry-max is the relevant one.
          Hide
          Tom White added a comment -

          I was actually just writing the following response: If we throw a different type of exception (TransientS3Exception) for recoverable error codes (there's a complete list of error codes here BTW: http://docs.amazonwebservices.com/AmazonS3/2006-03-01/ErrorCodeList.html) then we can make the proxy only retry for that exception. This is in line with HADOOP-601, but until that's done we could build an S3RetryProxy just for S3.

          We still might want to follow this more general approach (if it gives us more control), but for the time being if jets3t's retry mechanism is adequate then let's try that.

          Show
          Tom White added a comment - I was actually just writing the following response: If we throw a different type of exception (TransientS3Exception) for recoverable error codes (there's a complete list of error codes here BTW: http://docs.amazonwebservices.com/AmazonS3/2006-03-01/ErrorCodeList.html ) then we can make the proxy only retry for that exception. This is in line with HADOOP-601 , but until that's done we could build an S3RetryProxy just for S3. We still might want to follow this more general approach (if it gives us more control), but for the time being if jets3t's retry mechanism is adequate then let's try that.
          Hide
          Mike Smith added a comment -

          Instead of the released version of jets3t which is 5 months old, I built it from their CVS source codes. In this version unlike the released version, you will have the retry mechanism for S3 exceptions. The CVS version seems to be much more stable than the released version. Here are the two most important new features from their release notes:

          • Requests that fail due to S3 Internal Server error are retried a configurable number of times, with an increasing delay between each retry attempt.
          • Sends an upload object's MD5 data hash to S3 in the header Content-MD5, to confirm no data corruption has taken place on the wire.
          Show
          Mike Smith added a comment - Instead of the released version of jets3t which is 5 months old, I built it from their CVS source codes. In this version unlike the released version, you will have the retry mechanism for S3 exceptions. The CVS version seems to be much more stable than the released version. Here are the two most important new features from their release notes: Requests that fail due to S3 Internal Server error are retried a configurable number of times, with an increasing delay between each retry attempt. Sends an upload object's MD5 data hash to S3 in the header Content-MD5, to confirm no data corruption has taken place on the wire.
          Hide
          James P. White added a comment -

          I like the idea of a general approach to retries, but using an existing AOP mechanism seems like a better way to go to me. Although AspectJ is a strong tool, Spring 2.0 AOP is closer to what you propose. In addition to Spring's autoproxies, Spring 2.0 fully supports AspectJ syntax including annotations. That provides the greatest flexibility and avoids the invention of yet-another-syntax for advice.

          http://static.springframework.org/spring/docs/2.0.x/reference/new-in-2.html#new-in-2-aop

          Show
          James P. White added a comment - I like the idea of a general approach to retries, but using an existing AOP mechanism seems like a better way to go to me. Although AspectJ is a strong tool, Spring 2.0 AOP is closer to what you propose. In addition to Spring's autoproxies, Spring 2.0 fully supports AspectJ syntax including annotations. That provides the greatest flexibility and avoids the invention of yet-another-syntax for advice. http://static.springframework.org/spring/docs/2.0.x/reference/new-in-2.html#new-in-2-aop
          Hide
          Doug Cutting added a comment -

          > To implement retries, I was imagining creating a wrapper implementation of FileSystemStore that only had retry functionality

          This reminds me of HADOOP-601. That's not directly applicable, since S3 doesn't use Hadoop's RPC. But it makes me wonder if perhaps we ought to pursue a generic retry mechanism. Perhaps we could have a RetryProxy that uses java.lang.reflect.Proxy to implement retries for all methods in an interface. One could perhaps specify per-method retry policies as a ctor parameter, e.g.:

          MethodRetry[] retries = new MethodRetry[]

          { new MethodRetry("myFirstMethod", ONCE), new MethodRetry("mySecondMethod, FOREVER) }

          MyInterface impl = new MyInterfaceImpl();
          MyInterface proxy = RetryProxy.create(MyInterface.class, impl, retries);

          or somesuch. Could this work?

          Show
          Doug Cutting added a comment - > To implement retries, I was imagining creating a wrapper implementation of FileSystemStore that only had retry functionality This reminds me of HADOOP-601 . That's not directly applicable, since S3 doesn't use Hadoop's RPC. But it makes me wonder if perhaps we ought to pursue a generic retry mechanism. Perhaps we could have a RetryProxy that uses java.lang.reflect.Proxy to implement retries for all methods in an interface. One could perhaps specify per-method retry policies as a ctor parameter, e.g.: MethodRetry[] retries = new MethodRetry[] { new MethodRetry("myFirstMethod", ONCE), new MethodRetry("mySecondMethod, FOREVER) } MyInterface impl = new MyInterfaceImpl(); MyInterface proxy = RetryProxy.create(MyInterface.class, impl, retries); or somesuch. Could this work?
          Hide
          Tom White added a comment -

          As you say, the immediate RequestTimeout after an InternalError looks like a jets3t issue. Probably worth looking at the jets3t source or sending a message to the jets3t list?

          To implement retries, I was imagining creating a wrapper implementation of FileSystemStore that only had retry functionality in it and delegated to another FileSystemStore for actual storage. This would make testing easier as well as isolating the retry code. I think the InputStreams in FileSystemStore would need to be changed to be Files, but this should be fine.

          Show
          Tom White added a comment - As you say, the immediate RequestTimeout after an InternalError looks like a jets3t issue. Probably worth looking at the jets3t source or sending a message to the jets3t list? To implement retries, I was imagining creating a wrapper implementation of FileSystemStore that only had retry functionality in it and delegated to another FileSystemStore for actual storage. This would make testing easier as well as isolating the retry code. I think the InputStreams in FileSystemStore would need to be changed to be Files, but this should be fine.
          Hide
          Mike Smith added a comment -

          I've been working on a patch to handle these exceptions. There are three major exceptions that need to be retried.

          InternalError
          RequestTimeout
          OperationAborted

          InternalError exception has reasonably high rate for PUT request! I have finished the patch for Jets3tFileSystem.java which exponentially increases the waiting time. But, I've been dealing with a very strange problem. When I get the internalError (500) response, and I want to retry the request I keep getting RequestTimeout response from S3. This is a client exception and it shows that jets3t closes the connection after InternalError Exception! Even when I try to restablish the connection I still get the same RequestTimeout response. Following is the changed put() method in Jets3tFileSystem.java, I have done similar changes for other methods as well, let me know if you see something wrong:

          private void put(String key, InputStream in, long length) throws IOException {
          int attempts = 0;
          while(true){
          try

          { S3Object object = new S3Object(key); object.setDataInputStream(in); object.setContentType("binary/octet-stream"); object.setContentLength(length); s3Service.putObject(bucket, object); break; }

          catch (S3ServiceException e) {
          if(!retry(e,++attempts)){
          if (e.getCause() instanceof IOException)

          { throw (IOException) e.getCause(); }

          throw new S3Exception(e);
          }
          }
          }
          }

          private boolean retry(S3ServiceException e,int attempts){

          if(attempts > maxRetry) return false;

          // for internal exception (500), retry is allowed
          if(e.getErrorCode().equals(S3_INTERNAL_ERROR_CODE) ||
          e.getErrorCode().equals(S3_OPERATION_ABORTED_CODE)){
          LOG.info("retrying failed s3Service ["+e.getErrorCode()+"]. Delay: " +
          retryDelay*attempts+" msec. Attempts: "+attempts);
          try

          { Thread.sleep(retryDelay*attempts); }

          catch(Exception ee){}
          return true;
          }
          // allows retry for the socket timeout exception.
          // connection needs to be restablished.
          if(e.getErrorCode().equals(S3_REQUEST_TIMEOUT_CODE)){
          try

          { AWSCredentials awsCredentials = new AWSCredentials(accessKey, secretAccessKey); this.s3Service = new RestS3Service(awsCredentials); }

          catch (S3ServiceException ee)

          { // this exception will be taken care of later }

          LOG.info("retrying failed s3Service ["+e.getErrorCode()+"]. Attempts: "+attempts);
          return true;
          }

          // for all other exceptions retrying is not allowed
          // Maybe it would be better to keep retrying for all sorts of exceptions!?
          return false;
          }

          Show
          Mike Smith added a comment - I've been working on a patch to handle these exceptions. There are three major exceptions that need to be retried. InternalError RequestTimeout OperationAborted InternalError exception has reasonably high rate for PUT request! I have finished the patch for Jets3tFileSystem.java which exponentially increases the waiting time. But, I've been dealing with a very strange problem. When I get the internalError (500) response, and I want to retry the request I keep getting RequestTimeout response from S3. This is a client exception and it shows that jets3t closes the connection after InternalError Exception! Even when I try to restablish the connection I still get the same RequestTimeout response. Following is the changed put() method in Jets3tFileSystem.java, I have done similar changes for other methods as well, let me know if you see something wrong: private void put(String key, InputStream in, long length) throws IOException { int attempts = 0; while(true){ try { S3Object object = new S3Object(key); object.setDataInputStream(in); object.setContentType("binary/octet-stream"); object.setContentLength(length); s3Service.putObject(bucket, object); break; } catch (S3ServiceException e) { if(!retry(e,++attempts)){ if (e.getCause() instanceof IOException) { throw (IOException) e.getCause(); } throw new S3Exception(e); } } } } private boolean retry(S3ServiceException e,int attempts){ if(attempts > maxRetry) return false; // for internal exception (500), retry is allowed if(e.getErrorCode().equals(S3_INTERNAL_ERROR_CODE) || e.getErrorCode().equals(S3_OPERATION_ABORTED_CODE)){ LOG.info("retrying failed s3Service ["+e.getErrorCode()+"] . Delay: " + retryDelay*attempts+" msec. Attempts: "+attempts); try { Thread.sleep(retryDelay*attempts); } catch(Exception ee){} return true; } // allows retry for the socket timeout exception. // connection needs to be restablished. if(e.getErrorCode().equals(S3_REQUEST_TIMEOUT_CODE)){ try { AWSCredentials awsCredentials = new AWSCredentials(accessKey, secretAccessKey); this.s3Service = new RestS3Service(awsCredentials); } catch (S3ServiceException ee) { // this exception will be taken care of later } LOG.info("retrying failed s3Service ["+e.getErrorCode()+"] . Attempts: "+attempts); return true; } // for all other exceptions retrying is not allowed // Maybe it would be better to keep retrying for all sorts of exceptions!? return false; }
          Hide
          Bryan Pendleton added a comment -

          Argh. Jira just ate my comment, this will be a terser version.

          Retry levels should be configurable, up to the point of infinite retry. Long running stream operations are better not dying, even if they have to wait while S3 fixes hardware shortages/failures.

          Not sure if it's a separate issue, but failed writes aren't cleaned-up very well right now. In DFS, a file that isn't closed doesn't exist for other operations - if possible, there should at least be a way to find out if a file in S3 is "done", preferably it should be invisible to "normal" operations while its state is not final.

          Show
          Bryan Pendleton added a comment - Argh. Jira just ate my comment, this will be a terser version. – Retry levels should be configurable, up to the point of infinite retry. Long running stream operations are better not dying, even if they have to wait while S3 fixes hardware shortages/failures. Not sure if it's a separate issue, but failed writes aren't cleaned-up very well right now. In DFS, a file that isn't closed doesn't exist for other operations - if possible, there should at least be a way to find out if a file in S3 is "done", preferably it should be invisible to "normal" operations while its state is not final.

            People

            • Assignee:
              Tom White
              Reporter:
              Tom White
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development