Hadoop Common
  1. Hadoop Common
  2. HADOOP-574

want FileSystem implementation for Amazon S3

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 0.10.0
    • Component/s: fs
    • Labels:
      None

      Description

      An S3-based Hadoop FileSystem would make a great addition to Hadoop.

      It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:

      http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg00318.html

      This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.

      1. dependencies.zip
        705 kB
        Tom White
      2. HADOOP-574.patch
        41 kB
        Tom White
      3. HADOOP-574-v2.patch
        48 kB
        Tom White
      4. HADOOP-574-v3.patch
        51 kB
        Tom White

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          Hadoop's wiki now includes a page describing how to use Hadoop on Amazon EC2:

          http://wiki.apache.org/lucene-hadoop/AmazonEC2

          Show
          Doug Cutting added a comment - Hadoop's wiki now includes a page describing how to use Hadoop on Amazon EC2: http://wiki.apache.org/lucene-hadoop/AmazonEC2
          Hide
          Tom White added a comment -

          I've started work on this using a simple block design that borrows heavily from DFS. (I did a similar thing for the Java Map interface - although it was simpler - http://weblogs.java.net/blog/tomwhite/archive/2006/08/s3map.html) I should have something to show next week for discussion and review.

          Any suggestions on which package should this go in? org.hadoop.s3fs or org.hadoop.fs.s3, or something else?

          Show
          Tom White added a comment - I've started work on this using a simple block design that borrows heavily from DFS. (I did a similar thing for the Java Map interface - although it was simpler - http://weblogs.java.net/blog/tomwhite/archive/2006/08/s3map.html ) I should have something to show next week for discussion and review. Any suggestions on which package should this go in? org.hadoop.s3fs or org.hadoop.fs.s3, or something else?
          Hide
          Doug Cutting added a comment -

          > I've started work

          Great! Jim Kellerman (jim at powerset dot com) has also privately expressed interest in implementing this. Perhaps you can collaborate?

          > which package should this go in? org.hadoop.s3fs or org.hadoop.fs.s3, or something else?

          Let's go with org.apache.hadoop.fs.s3. Someday we should probably move dfs into a subpackage of fs...

          Are you going to attack HADOOP-571 too, or is anyone else working on that? Otherwise it will be hard to configure a cluster whose default filesystem is S3, or to submit a MapReduce job whose input or output use a non-default filesystem.

          Show
          Doug Cutting added a comment - > I've started work Great! Jim Kellerman (jim at powerset dot com) has also privately expressed interest in implementing this. Perhaps you can collaborate? > which package should this go in? org.hadoop.s3fs or org.hadoop.fs.s3, or something else? Let's go with org.apache.hadoop.fs.s3. Someday we should probably move dfs into a subpackage of fs... Are you going to attack HADOOP-571 too, or is anyone else working on that? Otherwise it will be hard to configure a cluster whose default filesystem is S3, or to submit a MapReduce job whose input or output use a non-default filesystem.
          Hide
          Doug Cutting added a comment -

          Here're some thoughts I sent Jim about this:

          DFS stores files as a sequence of ~100MB blocks. I think a scheme like this will be useful for an S3-based FileSystem too.

          When creating, each DFS block is first written locally to a temporary file, and, only when the block is full (or the file is closed) is the block actually written to DFS. This is instead of trying to trickle things to the network as they're written, which can run into timeout issues, etc. It also means that when a block write fails it can be easily retried.

          Very large files (up to a terabyte) should be supported. Breaking things into blocks should help here too. S3 limits an object value to 5GB. So each file can be represented as a set of ~100MB S3 object values. The set can be listed when the file is opened and used to guide seeks and reads of the data. The block number can be placed at the end of the name using a delimiter, so that access to metadata is not required when opening files or listing directories.

          Show
          Doug Cutting added a comment - Here're some thoughts I sent Jim about this: DFS stores files as a sequence of ~100MB blocks. I think a scheme like this will be useful for an S3-based FileSystem too. When creating, each DFS block is first written locally to a temporary file, and, only when the block is full (or the file is closed) is the block actually written to DFS. This is instead of trying to trickle things to the network as they're written, which can run into timeout issues, etc. It also means that when a block write fails it can be easily retried. Very large files (up to a terabyte) should be supported. Breaking things into blocks should help here too. S3 limits an object value to 5GB. So each file can be represented as a set of ~100MB S3 object values. The set can be listed when the file is opened and used to guide seeks and reads of the data. The block number can be placed at the end of the name using a delimiter, so that access to metadata is not required when opening files or listing directories.
          Hide
          Tom White added a comment -

          Thanks Doug. Collaboration sounds good: I'll contact Jim directly.

          Regarding HADOOP-571 I agree it makes sense to tackle this in conjunction. I'll have a look at it after we get the basics of the S3 filesystem working.

          As far as the design goes I agree that (like DFS) the S3 filesystem should divide things into blocks and buffer them to disk before writing them to S3. I'm not sure about using putting the block number at the end of the filename (using a delimiter) since this makes renames very inefficient as S3 has no rename operation. Instead I have opted for a level of indirection whereby the S3 object at the filename is a metadata file which lists the block IDs that hold the data. A rename then is simply a re-PUT of the metadata. What do you think?

          The other aspect I haven't put much thought into yet is locking. Keeping the number of HTTP requests to a minimum will be an interesting challenge.

          Show
          Tom White added a comment - Thanks Doug. Collaboration sounds good: I'll contact Jim directly. Regarding HADOOP-571 I agree it makes sense to tackle this in conjunction. I'll have a look at it after we get the basics of the S3 filesystem working. As far as the design goes I agree that (like DFS) the S3 filesystem should divide things into blocks and buffer them to disk before writing them to S3. I'm not sure about using putting the block number at the end of the filename (using a delimiter) since this makes renames very inefficient as S3 has no rename operation. Instead I have opted for a level of indirection whereby the S3 object at the filename is a metadata file which lists the block IDs that hold the data. A rename then is simply a re-PUT of the metadata. What do you think? The other aspect I haven't put much thought into yet is locking. Keeping the number of HTTP requests to a minimum will be an interesting challenge.
          Hide
          Doug Cutting added a comment -

          > The other aspect I haven't put much thought into yet is locking.

          I don't think locking is critical for for the first version. MapReduce jobs shouldn't generally require locking. Longer term we should probably consider adding a locking service to Hadoop directly, rather than relying on one in the filesystem implementation.

          If no one else steps up to the plate in the next day or so then I'll start implementing HADOOP-571 (uri-based FileSystem Paths). I'd like to get that into this month's release if possible.

          I also added a related thread to Amazon's EC2 forum.

          http://developer.amazonwebservices.com/connect/thread.jspa?threadID=12677

          When you get something working, please attach it as a patch to this issue. Thanks!

          Show
          Doug Cutting added a comment - > The other aspect I haven't put much thought into yet is locking. I don't think locking is critical for for the first version. MapReduce jobs shouldn't generally require locking. Longer term we should probably consider adding a locking service to Hadoop directly, rather than relying on one in the filesystem implementation. If no one else steps up to the plate in the next day or so then I'll start implementing HADOOP-571 (uri-based FileSystem Paths). I'd like to get that into this month's release if possible. I also added a related thread to Amazon's EC2 forum. http://developer.amazonwebservices.com/connect/thread.jspa?threadID=12677 When you get something working, please attach it as a patch to this issue. Thanks!
          Hide
          Tom White added a comment -

          Please find attached a patch for an S3 filesystem for review and discussion. It is not yet complete so I am not expecting it to be committed.
          There are a number of TODO comments in the code that list the remaining work items. These mainly concern corner cases or tidying up work - all the basic file system operations work.

          The code has been exercised by unit tests. To get these to run you need to fill in S3 access codes in the test hadoop-site.xml. (This raises an interesting question of how to have these run as automated tests.)

          I have not tried running Hadoop (or something like DfsShell) with this filesystem. We probably need to wait until HADOOP-571 is done.

          Looking forward to your feedback!

          Show
          Tom White added a comment - Please find attached a patch for an S3 filesystem for review and discussion. It is not yet complete so I am not expecting it to be committed. There are a number of TODO comments in the code that list the remaining work items. These mainly concern corner cases or tidying up work - all the basic file system operations work. The code has been exercised by unit tests. To get these to run you need to fill in S3 access codes in the test hadoop-site.xml. (This raises an interesting question of how to have these run as automated tests.) I have not tried running Hadoop (or something like DfsShell) with this filesystem. We probably need to wait until HADOOP-571 is done. Looking forward to your feedback!
          Hide
          Tom White added a comment -

          I forgot to mention that the code relies on the (Apache licensed) jets3t S3 toolkit (https://jets3t.dev.java.net/), which in turn requires commons-codec-1.3.jar and commons-httpclient-3.0.1.jar. (I wasn't sure whether I should attach these dependencies or not.)

          Show
          Tom White added a comment - I forgot to mention that the code relies on the (Apache licensed) jets3t S3 toolkit ( https://jets3t.dev.java.net/ ), which in turn requires commons-codec-1.3.jar and commons-httpclient-3.0.1.jar. (I wasn't sure whether I should attach these dependencies or not.)
          Hide
          Doug Cutting added a comment -

          Yes, please also attach the required jar files: that will save me having to find them!

          At a glance, the code looks great! It's unfortunate that one needs an account to run the unit tests. That means we can't require the tests in the mainline test task, since we don't want to require every developer to have an S3 account. However we can still run them as a part of the nightly task, using, e.g., my S3 account. Or maybe we could have these tests just print a warning when the account properites are undefined, but still succeed, so the default build is not broken?

          Now I need to get cracking on HADOOP-571...

          Show
          Doug Cutting added a comment - Yes, please also attach the required jar files: that will save me having to find them! At a glance, the code looks great! It's unfortunate that one needs an account to run the unit tests. That means we can't require the tests in the mainline test task, since we don't want to require every developer to have an S3 account. However we can still run them as a part of the nightly task, using, e.g., my S3 account. Or maybe we could have these tests just print a warning when the account properites are undefined, but still succeed, so the default build is not broken? Now I need to get cracking on HADOOP-571 ...
          Hide
          Tom White added a comment -

          Attached is a zip with the required dependencies.

          Regarding the unit testing, it is a pity that we require an account (and it costs money every time you run them!). However, I plan to write a stub implementation of FileSystemStore that can go in the mainline tests. We will still need a nightly test against the real S3 though. Or perhaps we could use ParkPlace (an S3 clone written in Ruby), although I don't know how faithful it is to Amazon...

          Show
          Tom White added a comment - Attached is a zip with the required dependencies. Regarding the unit testing, it is a pity that we require an account (and it costs money every time you run them!). However, I plan to write a stub implementation of FileSystemStore that can go in the mainline tests. We will still need a nightly test against the real S3 though. Or perhaps we could use ParkPlace (an S3 clone written in Ruby), although I don't know how faithful it is to Amazon...
          Hide
          Doug Cutting added a comment -

          That sounds like a good plan: by default unit tests should not attempt to connect to S3. But it should be easy to configure things so that S3 is used, and we can do that in the nightlies (provided my wife approves the expense!).

          Show
          Doug Cutting added a comment - That sounds like a good plan: by default unit tests should not attempt to connect to S3. But it should be easy to configure things so that S3 is used, and we can do that in the nightlies (provided my wife approves the expense!).
          Hide
          Tom White added a comment -

          I've been looking at Jim Kellerman's suggestion of using a URL-safe Base64 encoding for path names since they are more compact than regular URL encoding. (An S3 key is a maximum of 1024 bytes.) See http://www.faqs.org/rfcs/rfc3548.html and http://www.faqs.org/qa/rfcc-1940.html for details. Jim has implemented these algorithms in a public domain Base64 encoding package
          (http://iharder.sourceforge.net/current/java/base64/, version 2.2).

          However, I don't think Base 64 encoding is compatible with the delimiter request parameter for S3 bucket listing since base 64 encoding doesn't preserve byte boundaries, so it is not possible to search for a substring in the base 64 encoded representation of some text. The current S3 FileSystem code uses the delimiter request parameter as an efficient way to implement the listPathsRaw method. Therefore I think it is probably best to stick with the current URL encoding solution.

          Show
          Tom White added a comment - I've been looking at Jim Kellerman's suggestion of using a URL-safe Base64 encoding for path names since they are more compact than regular URL encoding. (An S3 key is a maximum of 1024 bytes.) See http://www.faqs.org/rfcs/rfc3548.html and http://www.faqs.org/qa/rfcc-1940.html for details. Jim has implemented these algorithms in a public domain Base64 encoding package ( http://iharder.sourceforge.net/current/java/base64/ , version 2.2). However, I don't think Base 64 encoding is compatible with the delimiter request parameter for S3 bucket listing since base 64 encoding doesn't preserve byte boundaries, so it is not possible to search for a substring in the base 64 encoded representation of some text. The current S3 FileSystem code uses the delimiter request parameter as an efficient way to implement the listPathsRaw method. Therefore I think it is probably best to stick with the current URL encoding solution.
          Hide
          Tom White added a comment -

          Please find attached an updated patch that includes a stub implementation of FileSystemStore for the unit tests. The nightly tests should be the only ones that run against S3 (as discussed), but I haven't changed any config to actually make this happen.

          The patch also includes updates for many of the remaining TODOs, which were mainly testing and implementing edge cases.

          Show
          Tom White added a comment - Please find attached an updated patch that includes a stub implementation of FileSystemStore for the unit tests. The nightly tests should be the only ones that run against S3 (as discussed), but I haven't changed any config to actually make this happen. The patch also includes updates for many of the remaining TODOs, which were mainly testing and implementing edge cases.
          Hide
          Tom White added a comment -

          As an experiment I modified FileSystem.getNamed() to include a check for s3 (note that this change is not in HADOOP-574-v2.patch):

          if ("local".equals(name))

          { fs = new LocalFileSystem(conf); }

          else if ("s3".equals(name))

          { fs = new S3FileSystem(conf); }

          else

          { fs = new DistributedFileSystem(DataNode.createSocketAddr(name), conf); }

          I then built a hadoop distribution, unpacked, created a hadoop-site.xml wih my S3 keys and bucket name and ran the following in the bin directory:

          ./hadoop dfs -conf hadoop-site.xml -fs s3 -mkdir /tmp/tom
          ./hadoop dfs -conf hadoop-site.xml -fs s3 -copyFromLocal s3test.txt /tmp/tom/s3test.txt
          ./hadoop dfs -conf hadoop-site.xml -fs s3 -copyToLocal /tmp/tom/s3test.txt s3test.copy.txt
          diff s3test.copy.txt s3test.txt
          ./hadoop dfs -conf hadoop-site.xml -fs s3 -rm /tmp/tom/s3test.txt
          ./hadoop dfs -conf hadoop-site.xml -fs s3 -rmr /tmp/tom

          All the commands succeeded and the diff showed that the files were the same. This was a great sanity check!

          This suggests to me that DFSShell is really more general than DFS - perhaps it should be renamed FSShell?

          Next steps - I guess I need to see how the S3 implementation fits with HADOOP-571. Suggestions welcome.

          Show
          Tom White added a comment - As an experiment I modified FileSystem.getNamed() to include a check for s3 (note that this change is not in HADOOP-574 -v2.patch): if ("local".equals(name)) { fs = new LocalFileSystem(conf); } else if ("s3".equals(name)) { fs = new S3FileSystem(conf); } else { fs = new DistributedFileSystem(DataNode.createSocketAddr(name), conf); } I then built a hadoop distribution, unpacked, created a hadoop-site.xml wih my S3 keys and bucket name and ran the following in the bin directory: ./hadoop dfs -conf hadoop-site.xml -fs s3 -mkdir /tmp/tom ./hadoop dfs -conf hadoop-site.xml -fs s3 -copyFromLocal s3test.txt /tmp/tom/s3test.txt ./hadoop dfs -conf hadoop-site.xml -fs s3 -copyToLocal /tmp/tom/s3test.txt s3test.copy.txt diff s3test.copy.txt s3test.txt ./hadoop dfs -conf hadoop-site.xml -fs s3 -rm /tmp/tom/s3test.txt ./hadoop dfs -conf hadoop-site.xml -fs s3 -rmr /tmp/tom All the commands succeeded and the diff showed that the files were the same. This was a great sanity check! This suggests to me that DFSShell is really more general than DFS - perhaps it should be renamed FSShell? Next steps - I guess I need to see how the S3 implementation fits with HADOOP-571 . Suggestions welcome.
          Hide
          Doug Cutting added a comment -

          > DFSShell is really more general than DFS - perhaps it should be renamed FSShell?

          Yes, it probably should, since it is largely generic.

          > I need to see how the S3 implementation fits with HADOOP-571. Suggestions welcome.

          What should the URI look like? The scheme should be "s3", that's easy, but what about the authority? I think it makes sense to map bucket to host, access key id to user and secret access key to password. So that would give URIs like s3://id:secret@bucket/. One's hadoop-site.xml could then just specify the fs.default.name using this syntax.

          Other than that, all you should need to do is define a few new FileSystem methods (getUri(), initialize(URI, Configuration)) and a new configuration property (fs.s3.impl).

          Show
          Doug Cutting added a comment - > DFSShell is really more general than DFS - perhaps it should be renamed FSShell? Yes, it probably should, since it is largely generic. > I need to see how the S3 implementation fits with HADOOP-571 . Suggestions welcome. What should the URI look like? The scheme should be "s3", that's easy, but what about the authority? I think it makes sense to map bucket to host, access key id to user and secret access key to password. So that would give URIs like s3://id:secret@bucket/. One's hadoop-site.xml could then just specify the fs.default.name using this syntax. Other than that, all you should need to do is define a few new FileSystem methods (getUri(), initialize(URI, Configuration)) and a new configuration property (fs.s3.impl).
          Hide
          stack added a comment -

          +1 on scheme.

          Regards mixing in of access id and access secret, users will need to be a careful with job input files that list s3 files by URI. On bucket as host, its probably safe to say the s3 hostname, s3.amazonaws.com, isn't likely to change any time soon. And if ever there was to be an s3 filesystem that went over SSL, its scheme might be s3s?

          Show
          stack added a comment - +1 on scheme. Regards mixing in of access id and access secret, users will need to be a careful with job input files that list s3 files by URI. On bucket as host, its probably safe to say the s3 hostname, s3.amazonaws.com, isn't likely to change any time soon. And if ever there was to be an s3 filesystem that went over SSL, its scheme might be s3s?
          Hide
          Tom White added a comment -

          I like the proposal and started to implement it, but then I discovered that my secret key contains '/' characters (two in fact), which makes it awkward to turn into a URI, since you have to manually escape them.

          While this is doable, it seems to me that it makes it harder to configure, and could be a source of interesting support questions. I haven't really got a better alternative though.

          Show
          Tom White added a comment - I like the proposal and started to implement it, but then I discovered that my secret key contains '/' characters (two in fact), which makes it awkward to turn into a URI, since you have to manually escape them. While this is doable, it seems to me that it makes it harder to configure, and could be a source of interesting support questions. I haven't really got a better alternative though.
          Hide
          James P. White added a comment -

          What about putting the secret in a keystore, or at least some configuration file?

          That also opens up the possibility of supporting a default id, which would enable folks to set up shared/reusable jobs.

          Show
          James P. White added a comment - What about putting the secret in a keystore, or at least some configuration file? That also opens up the possibility of supporting a default id, which would enable folks to set up shared/reusable jobs.
          Hide
          Tom White added a comment -

          Thinking about this more perhaps the best thing would be to put the access key and secret access key into the configuration file (as implemented in the first two patches, the properties are fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey), but keep the bucket name as a part of the URI: s3://bucket/.

          As an enhancement we might allow the keys to be specified in the URI (per Doug's original suggestion) if people wanted to put them there (escaped). These would take precedence over keys specified in the config file.

          Show
          Tom White added a comment - Thinking about this more perhaps the best thing would be to put the access key and secret access key into the configuration file (as implemented in the first two patches, the properties are fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey), but keep the bucket name as a part of the URI: s3://bucket/. As an enhancement we might allow the keys to be specified in the URI (per Doug's original suggestion) if people wanted to put them there (escaped). These would take precedence over keys specified in the config file.
          Hide
          James P. White added a comment -

          I was thinking that you would retain the parsing of the id and secret from the URI, but that the secret would be found by id in the keystore/configuration file if omitted.

          The same approach would apply to the id if omitted, as the configuration file could supply a default id (fs.s3.awsAccessKeyId), which would then lead to the secret.

          One way to do that would be to use the configuration file and put the id in the property name:

          fs.s3.awsSecretAccessKey.<myid>
          fs.s3.awsSecretAccessKey.<myfriendsid>

          That ultimately would be compatible with the old schme if fs.s3.awsSecretAccessKey were the default secret if not otherwise specified (effectively fs.s3.awsSecretAccessKey.*).

          The only reservation I have with putting the secret in the configuration file is I don't like using plain text for secrets. An encrypted store would be nicer. Supporting OpenSSH Agents would be very nice too.

          Show
          James P. White added a comment - I was thinking that you would retain the parsing of the id and secret from the URI, but that the secret would be found by id in the keystore/configuration file if omitted. The same approach would apply to the id if omitted, as the configuration file could supply a default id (fs.s3.awsAccessKeyId), which would then lead to the secret. One way to do that would be to use the configuration file and put the id in the property name: fs.s3.awsSecretAccessKey.<myid> fs.s3.awsSecretAccessKey.<myfriendsid> That ultimately would be compatible with the old schme if fs.s3.awsSecretAccessKey were the default secret if not otherwise specified (effectively fs.s3.awsSecretAccessKey.*). The only reservation I have with putting the secret in the configuration file is I don't like using plain text for secrets. An encrypted store would be nicer. Supporting OpenSSH Agents would be very nice too.
          Hide
          Tom White added a comment -

          I've now got a version of this that works with HADOOP-571 patch 5. You can specify access key id and secret access key in the URI (Doug's suggestion), or you can specify them in hadoop-site.xml. I haven't implemented support for multiple keys or encrypted keys (James' suggestion), but don't see why this couldn't be added in the future.

          I'll create a patch after HADOOP-571 is committed.

          Show
          Tom White added a comment - I've now got a version of this that works with HADOOP-571 patch 5. You can specify access key id and secret access key in the URI (Doug's suggestion), or you can specify them in hadoop-site.xml. I haven't implemented support for multiple keys or encrypted keys (James' suggestion), but don't see why this couldn't be added in the future. I'll create a patch after HADOOP-571 is committed.
          Hide
          Tom White added a comment -

          Here is a version that works with the changes made in HADOOP-571 that are now in trunk.

          The unit test TestJets3tS3FileSystem runs against S3 and requires an S3 account - it should be excluded from the mainline tests and run nightly perhaps (I have not made the changes to the build file to do this). TestInMemoryS3FileSystem should stay in the mainline tests.

          There is some further work to make the implementation more robust (by implementing retries), but I believe the code is ready to be committed.

          Show
          Tom White added a comment - Here is a version that works with the changes made in HADOOP-571 that are now in trunk. The unit test TestJets3tS3FileSystem runs against S3 and requires an S3 account - it should be excluded from the mainline tests and run nightly perhaps (I have not made the changes to the build file to do this). TestInMemoryS3FileSystem should stay in the mainline tests. There is some further work to make the implementation more robust (by implementing retries), but I believe the code is ready to be committed.
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Tom, it looks great!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Tom, it looks great!

            People

            • Assignee:
              Unassigned
              Reporter:
              Doug Cutting
            • Votes:
              4 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development