Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-10400

Incorporate new S3A FileSystem implementation

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 2.6.0
    • Component/s: fs, fs/s3
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      The s3native filesystem has a number of limitations (some of which were recently fixed by HADOOP-9454). This patch adds an s3a filesystem which uses the aws-sdk instead of the jets3t library. There are a number of improvements over s3native including:

      • Parallel copy (rename) support (dramatically speeds up commits on large files)
      • AWS S3 explorer compatible empty directories files "xyz/" instead of "xyz_$folder$" (reduces littering)
      • Ignores s3native created _$folder$ files created by s3native and other S3 browsing utilities
      • Supports multiple output buffer dirs to even out IO when uploading files
      • Supports IAM role-based authentication
      • Allows setting a default canned ACL for uploads (public, private, etc.)
      • Better error recovery handling
      • Should handle input seeks without having to download the whole file (used for splits a lot)

      This code is a copy of https://github.com/Aloisius/hadoop-s3a with patches to various pom files to get it to build against trunk. I've been using 0.0.1 in production with CDH 4 for several months and CDH 5 for a few days. The version here is 0.0.2 which changes around some keys to hopefully bring the key name style more inline with the rest of hadoop 2.x.

      Tunable parameters:

      fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
      fs.s3a.secret.key - Your AWS secret key (omit for role authentication)
      fs.s3a.connection.maximum - Controls how many parallel connections HttpClient spawns (default: 15)
      fs.s3a.connection.ssl.enabled - Enables or disables SSL connections to S3 (default: true)
      fs.s3a.attempts.maximum - How many times we should retry commands on transient errors (default: 10)
      fs.s3a.connection.timeout - Socket connect timeout (default: 5000)
      fs.s3a.paging.maximum - How many keys to request from S3 when doing directory listings at a time (default: 5000)
      fs.s3a.multipart.size - How big (in bytes) to split a upload or copy operation up into (default: 104857600)
      fs.s3a.multipart.threshold - Until a file is this large (in bytes), use non-parallel upload (default: 2147483647)
      fs.s3a.acl.default - Set a canned ACL on newly created/copied objects (private | public-read | public-read-write | authenticated-read | log-delivery-write | bucket-owner-read | bucket-owner-full-control)
      fs.s3a.multipart.purge - True if you want to purge existing multipart uploads that may not have been completed/aborted correctly (default: false)
      fs.s3a.multipart.purge.age - Minimum age in seconds of multipart uploads to purge (default: 86400)
      fs.s3a.buffer.dir - Comma separated list of directories that will be used to buffer file writes out of (default: uses $

      {hadoop.tmp.dir}

      /s3a )

      Caveats:

      Hadoop uses a standard output committer which uploads files as filename.COPYING before renaming them. This can cause unnecessary performance issues with S3 because it does not have a rename operation and S3 already verifies uploads against an md5 that the driver sets on the upload request. While this FileSystem should be significantly faster than the built-in s3native driver because of parallel copy support, you may want to consider setting a null output committer on our jobs to further improve performance.

      Because S3 requires the file length and MD5 to be known before a file is uploaded, all output is buffered out to a temporary file first similar to the s3native driver.

      Due to the lack of native rename() for S3, renaming extremely large files or directories make take a while. Unfortunately, there is no way to notify hadoop that progress is still being made for rename operations, so your job may time out unless you increase the task timeout.

      This driver will fully ignore _$folder$ files. This was necessary so that it could interoperate with repositories that have had the s3native driver used on them, but means that it won't recognize empty directories that s3native has been used on.

      Statistics for the filesystem may be calculated differently than the s3native filesystem. When uploading a file, we do not count writing the temporary file on the local filesystem towards the local filesystem's written bytes count. When renaming files, we do not count the S3->S3 copy as read or write operations. Unlike the s3native driver, we only count bytes written when we start the upload (as opposed to the write calls to the temporary local file). The driver also counts read & write ops, but they are done mostly to keep from timing out on large s3 operations.

      The AWS SDK unfortunately passes the multipart threshold as an int which means
      fs.s3a.multipart.threshold can not be greater than 2^31-1 (2147483647).

      This is currently implemented as a FileSystem and not a AbstractFileSystem.

      1. HADOOP-10400-1.patch
        74 kB
        Jordan Mendelson
      2. HADOOP-10400-2.patch
        74 kB
        Jordan Mendelson
      3. HADOOP-10400-3.patch
        75 kB
        Jordan Mendelson
      4. HADOOP-10400-4.patch
        75 kB
        Jordan Mendelson
      5. HADOOP-10400-5.patch
        75 kB
        Jordan Mendelson
      6. HADOOP-10400-6.patch
        75 kB
        Matteo Bertozzi
      7. HADOOP-10400-7.patch
        68 kB
        David S. Wang
      8. HADOOP-10400-8.patch
        97 kB
        David S. Wang
      9. HADOOP-10400-8-branch-2.patch
        96 kB
        David S. Wang
      10. HADOOP-10400-branch-2.patch
        67 kB
        David S. Wang

        Issue Links

          Activity

          Hide
          michaelthoward Michael Howard added a comment -

          Thank you for this!
          Only a few weeks ago I asked AWS support about S3 support that used the AWS SDK for Java rather than jets3t.

          Show
          michaelthoward Michael Howard added a comment - Thank you for this! Only a few weeks ago I asked AWS support about S3 support that used the AWS SDK for Java rather than jets3t.
          Hide
          aloisius Jordan Mendelson added a comment -

          Sets the core-defaults.xml file to properly match the defaults in the s3a driver.

          Show
          aloisius Jordan Mendelson added a comment - Sets the core-defaults.xml file to properly match the defaults in the s3a driver.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12633839/HADOOP-10400-1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.
          See https://builds.apache.org/job/PreCommit-HADOOP-Build/3657//artifact/trunk/patchprocess/diffJavadocWarnings.txt for details.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3657//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/3657//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3657//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633839/HADOOP-10400-1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 javadoc . The javadoc tool appears to have generated 2 warning messages. See https://builds.apache.org/job/PreCommit-HADOOP-Build/3657//artifact/trunk/patchprocess/diffJavadocWarnings.txt for details. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3657//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/3657//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3657//console This message is automatically generated.
          Hide
          aloisius Jordan Mendelson added a comment -

          HADOOP-10400-3 should take care of the linter problems.

          Show
          aloisius Jordan Mendelson added a comment - HADOOP-10400 -3 should take care of the linter problems.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12633849/HADOOP-10400-3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3658//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/3658//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3658//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633849/HADOOP-10400-3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3658//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/3658//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3658//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12633865/HADOOP-10400-4.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs:

          org.apache.hadoop.metrics2.impl.TestMetricsSystemImpl

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3660//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3660//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633865/HADOOP-10400-4.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs: org.apache.hadoop.metrics2.impl.TestMetricsSystemImpl +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3660//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3660//console This message is automatically generated.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Marking as depending on HADOOP-9565 Blobstore interface, HADOOP-10373 -hadoop-aws.jar, and the extra contract tests of HADOOP-9361.

          Before adding more AWS/blobstore logic, we need to get the contract/tests tightened and move it out of core for easy swapping in & out

          Show
          stevel@apache.org Steve Loughran added a comment - Marking as depending on HADOOP-9565 Blobstore interface, HADOOP-10373 -hadoop-aws.jar, and the extra contract tests of HADOOP-9361 . Before adding more AWS/blobstore logic, we need to get the contract/tests tightened and move it out of core for easy swapping in & out
          Hide
          stevel@apache.org Steve Loughran added a comment -
          1. I've filed some dependencies; the '9361 is on my plate but only a very limited amount of time; I do need to finish it and the rest can come quickly.
          2. as well as the new tests, it'll need to extend{{FileSystemContractBaseTest}}.
          3. I'd like to avoid having >1 S3 "native" client, better to do a hard switch than have a parallel codebase to maintain. Can this replace S3N today?
          Show
          stevel@apache.org Steve Loughran added a comment - I've filed some dependencies; the '9361 is on my plate but only a very limited amount of time; I do need to finish it and the rest can come quickly. as well as the new tests, it'll need to extend{{FileSystemContractBaseTest}}. I'd like to avoid having >1 S3 "native" client, better to do a hard switch than have a parallel codebase to maintain. Can this replace S3N today?
          Hide
          atm Aaron T. Myers added a comment -

          Hi Steve,

          Marking as depending on HADOOP-9565 Blobstore interface, HADOOP-10373 -hadoop-aws.jar, and the extra contract tests of HADOOP-9361.

          Before adding more AWS/blobstore logic, we need to get the contract/tests tightened and move it out of core for easy swapping in & out
          ...
          I've filed some dependencies; the '9361 is on my plate but only a very limited amount of time; I do need to finish it and the rest can come quickly.

          While it would certainly be nice to have these done before committing S3A, I don't think it's reasonable to make these firm dependencies, mostly because they're not actually strictly required for the functionality, and also because we don't know when these other things will get done. I think the improved S3 integration provided by S3A is important enough that we shouldn't hold it up for these.

          as well as the new tests, it'll need to extend{{FileSystemContractBaseTest}}.

          Yes, I agree with this. Jordan, can you take care of hooking S3A up to the FileSystemContractBaseTest?

          I'd like to avoid having >1 S3 "native" client, better to do a hard switch than have a parallel codebase to maintain. Can this replace S3N today?

          I disagree. I think we should deprecate S3N and introduce S3A side-by-side. Since it'll be deprecated, presumably folks will slow or stop maintenance of S3N, so there doesn't seem to be much benefit to forcing a hard switch, which has the potential to be disruptive.

          Show
          atm Aaron T. Myers added a comment - Hi Steve, Marking as depending on HADOOP-9565 Blobstore interface, HADOOP-10373 -hadoop-aws.jar, and the extra contract tests of HADOOP-9361 . Before adding more AWS/blobstore logic, we need to get the contract/tests tightened and move it out of core for easy swapping in & out ... I've filed some dependencies; the '9361 is on my plate but only a very limited amount of time; I do need to finish it and the rest can come quickly. While it would certainly be nice to have these done before committing S3A, I don't think it's reasonable to make these firm dependencies, mostly because they're not actually strictly required for the functionality, and also because we don't know when these other things will get done. I think the improved S3 integration provided by S3A is important enough that we shouldn't hold it up for these. as well as the new tests, it'll need to extend{{FileSystemContractBaseTest}}. Yes, I agree with this. Jordan, can you take care of hooking S3A up to the FileSystemContractBaseTest? I'd like to avoid having >1 S3 "native" client, better to do a hard switch than have a parallel codebase to maintain. Can this replace S3N today? I disagree. I think we should deprecate S3N and introduce S3A side-by-side. Since it'll be deprecated, presumably folks will slow or stop maintenance of S3N, so there doesn't seem to be much benefit to forcing a hard switch, which has the potential to be disruptive.
          Hide
          aloisius Jordan Mendelson added a comment -

          If we're asking for a wishlist of things to implement:

          1. It would be very nice if the FileSystem interface was extended to allow passing a recursive flag to ListStatus requests with some kind of iterable response interface.
            • S3 allows us to issue a single request to fetch all the keys given a prefix with a single API call which is dramatically faster than requesting them one at a time. This could be used for things like the fs du -s command and make it infinitely faster as well as adding the common addInputPath. It will also reduce the number of API calls (which Amazon charges for) dramatically.
          2. It would be nice if there was some way of informing the default committer that it is not necessary for a given FileSystem.
            • Since S3 uploads are verified by MD5 when they are uploaded, file uploads that fail halfway through won't show up on S3 so there is no method of corruption.
          3. A standard method that allows a direct FileSystem to FileSystem copy like copyFile(Path, Path) would be lovely in the FileSystem interface.
            • S3 allows a S3->S3 copy. It actually allows more complicated scenarios such as a multiple-file-on-S3->S3 copy.
          Show
          aloisius Jordan Mendelson added a comment - If we're asking for a wishlist of things to implement: It would be very nice if the FileSystem interface was extended to allow passing a recursive flag to ListStatus requests with some kind of iterable response interface. S3 allows us to issue a single request to fetch all the keys given a prefix with a single API call which is dramatically faster than requesting them one at a time. This could be used for things like the fs du -s command and make it infinitely faster as well as adding the common addInputPath. It will also reduce the number of API calls (which Amazon charges for) dramatically. It would be nice if there was some way of informing the default committer that it is not necessary for a given FileSystem. Since S3 uploads are verified by MD5 when they are uploaded, file uploads that fail halfway through won't show up on S3 so there is no method of corruption. A standard method that allows a direct FileSystem to FileSystem copy like copyFile(Path, Path) would be lovely in the FileSystem interface. S3 allows a S3->S3 copy. It actually allows more complicated scenarios such as a multiple-file-on-S3->S3 copy.
          Hide
          aloisius Jordan Mendelson added a comment -

          This new version (-5) adjusts the test to use FileSystemContractBaseTest.

          However because FileSystemContractBaseTest uses the JUnit 3 runner, we can't use assume*() functions to skip the tests if we don't have a valid URL to test against, so the test has been renamed from TestS3AFileSystem to S3AFileSystemContractBaseTest and is consequently no longer run by default. A better solution would be to update FileSystemContractBaseTest for JUnit 4, but there are a lot of dependencies and that seems outside the scope of this patch.

          I'm not entirely sure why the linter is complaining about these patches. The last one complained about code completely outside of mine. Odd.

          Show
          aloisius Jordan Mendelson added a comment - This new version (-5) adjusts the test to use FileSystemContractBaseTest. However because FileSystemContractBaseTest uses the JUnit 3 runner, we can't use assume*() functions to skip the tests if we don't have a valid URL to test against, so the test has been renamed from TestS3AFileSystem to S3AFileSystemContractBaseTest and is consequently no longer run by default. A better solution would be to update FileSystemContractBaseTest for JUnit 4, but there are a lot of dependencies and that seems outside the scope of this patch. I'm not entirely sure why the linter is complaining about these patches. The last one complained about code completely outside of mine. Odd.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12635689/HADOOP-10400-5.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3684//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3684//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635689/HADOOP-10400-5.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3684//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3684//console This message is automatically generated.
          Hide
          aloisius Jordan Mendelson added a comment -

          Steve Loughran

          I've filed some dependencies; the '9361 is on my plate but only a very limited amount of time; I do need to finish it and the rest can come quickly.

          Honestly in lieu of documentation or contract tests, I just went into other filesystem code and saw what the most common error types/checks/etc were and duplicated them, but there is a lot of ambiguity about the right way of doing things. Heck, the fact we strip off the trailing / from paths before it even makes it to the filesystem layer makes just following POSIX move rules problematic

          as well as the new tests, it'll need to extend{{FileSystemContractBaseTest}}.

          Should be done with the latest patch though honestly I didn't let the tests run very long.

          I'd like to avoid having >1 S3 "native" client, better to do a hard switch than have a parallel codebase to maintain. Can this replace S3N today?

          It could replace it and I've been running it in production for months, but there is one possible incompatibility that might cause people some problems - the handling of the directory marker files. In s3n they were named directory_$folder$. In s3a, they are directory/. The reason for the switch is every other tool including Amazon's own web interface uses the later. I'm not entirely certain what kind of problems you might see by just switching people from s3n to s3a on buckets they've previously used s3n on. S3A will actually ignore those files altogether so they don't get added to things like inputPaths by mistake, but I can imagine some edge conditions where it wouldn't be a perfectly transparent switch.

          Show
          aloisius Jordan Mendelson added a comment - Steve Loughran I've filed some dependencies; the '9361 is on my plate but only a very limited amount of time; I do need to finish it and the rest can come quickly. Honestly in lieu of documentation or contract tests, I just went into other filesystem code and saw what the most common error types/checks/etc were and duplicated them, but there is a lot of ambiguity about the right way of doing things. Heck, the fact we strip off the trailing / from paths before it even makes it to the filesystem layer makes just following POSIX move rules problematic as well as the new tests, it'll need to extend{{FileSystemContractBaseTest}}. Should be done with the latest patch though honestly I didn't let the tests run very long. I'd like to avoid having >1 S3 "native" client, better to do a hard switch than have a parallel codebase to maintain. Can this replace S3N today? It could replace it and I've been running it in production for months, but there is one possible incompatibility that might cause people some problems - the handling of the directory marker files. In s3n they were named directory_$folder$. In s3a, they are directory/. The reason for the switch is every other tool including Amazon's own web interface uses the later. I'm not entirely certain what kind of problems you might see by just switching people from s3n to s3a on buckets they've previously used s3n on. S3A will actually ignore those files altogether so they don't get added to things like inputPaths by mistake, but I can imagine some edge conditions where it wouldn't be a perfectly transparent switch.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          there is a lot of ambiguity about the right way of doing things

          yes there is ...

          Show
          stevel@apache.org Steve Loughran added a comment - there is a lot of ambiguity about the right way of doing things yes there is ...
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Aaron T. Myers
          OK, we can run these side by side. After all, the user can flip a switch in hadoop-config.xml to change over. and they could do a migrate from one layout to another with distcp s3n://src s3a://dest

          At the very least, I'd like HADOOP-10373 -a new tools/hadoop-aws JAR with the filesystem and the new dependencies.

          1. keeps the existing hadoop client dependencies down
          2. sets things up for migration of s3 and s3n in future
          3. let's the build skip the entire test suite (including the extension of the Junit3 test class) if the relevant auth keys aren't present. This is what the hadoop-openstack POM does https://github.com/apache/hadoop-common/blob/trunk/hadoop-tools/hadoop-openstack/pom.xml#L40 -with a file, auth-keys.xml, that is set up to be .gitignored, so stopping anyone accidentally checking in their keys.
          Show
          stevel@apache.org Steve Loughran added a comment - Aaron T. Myers OK, we can run these side by side. After all, the user can flip a switch in hadoop-config.xml to change over. and they could do a migrate from one layout to another with distcp s3n://src s3a://dest At the very least, I'd like HADOOP-10373 -a new tools/hadoop-aws JAR with the filesystem and the new dependencies. keeps the existing hadoop client dependencies down sets things up for migration of s3 and s3n in future let's the build skip the entire test suite (including the extension of the Junit3 test class) if the relevant auth keys aren't present. This is what the hadoop-openstack POM does https://github.com/apache/hadoop-common/blob/trunk/hadoop-tools/hadoop-openstack/pom.xml#L40 -with a file, auth-keys.xml, that is set up to be .gitignored, so stopping anyone accidentally checking in their keys.
          Hide
          savu.andrei Andrei Savu added a comment -

          The patch looks good to me - I've done some minimal testing.

          Here are some of things we should improve before committing:

          I am also +1 on Steve's suggestion of adding tools/hadoop-aws.

          Show
          savu.andrei Andrei Savu added a comment - The patch looks good to me - I've done some minimal testing. Here are some of things we should improve before committing: S3AFileSystem should incorporate HADOOP-10511 expose S3 server side encryption as a configuration flag (similar to HADOOP-10568 ): http://docs.aws.amazon.com/AmazonS3/latest/dev/SSEUsingJavaSDK.html improve delete logic to work with 1000+ keys. This is not documented in Java SDK but you can only delete 1000 keys per REST API call: http://docs.aws.amazon.com/AmazonS3/latest/API/multiobjectdeleteapi.html I am also +1 on Steve's suggestion of adding tools/hadoop-aws.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Andrei points out something else: more scale tests. There's something in swiftfs that does many-file operations, which picked up throttling problems on some services

          Show
          stevel@apache.org Steve Loughran added a comment - Andrei points out something else: more scale tests. There's something in swiftfs that does many-file operations, which picked up throttling problems on some services
          Hide
          amansk Amandeep Khurana added a comment -

          Jordan Mendelson - Why do you have the old and new properties in the constants file? Why not just a single set of properties? Also, if someone is using S3N and has set the credentials for that, should this just pick them up or do you want to have users explicitly specify credentials for S3A?

          Otherwise, +1 to the following suggestions made earlier:
          1. Expose SSE as a config
          2. Adding tools/hadoop-aws

          This patch is good to go IMO and doesn't need to block on any of the above. All the suggestions can be put in as incremental add-ons in subsequent patches. However, it'll be nice to have them all committed before the next release.

          Show
          amansk Amandeep Khurana added a comment - Jordan Mendelson - Why do you have the old and new properties in the constants file? Why not just a single set of properties? Also, if someone is using S3N and has set the credentials for that, should this just pick them up or do you want to have users explicitly specify credentials for S3A? Otherwise, +1 to the following suggestions made earlier: 1. Expose SSE as a config 2. Adding tools/hadoop-aws This patch is good to go IMO and doesn't need to block on any of the above. All the suggestions can be put in as incremental add-ons in subsequent patches. However, it'll be nice to have them all committed before the next release.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Amadeep, the patch is not good to go until it works, which is what the extra HADOOP-9361 will do -testing a lot more than what is in the existing FS contract. I am confident that this is the case, because HADOOP-10533 shows that the last update to s3n caused a lot of regressions. We need the extra tests so that we can be confident that the code works as expected.

          I think '9361 is nearly ready to go in -if we can get the core spec and abstract tests in, then s3a can pick them up quickly, and there's less to worry about in terms of backwards compatibility in any changes.

          One thing we could do, quickly, is create a stub hadoop-tools/hadoop-aws module that has nothing but the code structure and the maven dependency this patch could then use that as the basis for code -rather than build changes. I can help do that with a build that doesn't run tests until a tests configuratin resource file is present

          Show
          stevel@apache.org Steve Loughran added a comment - Amadeep, the patch is not good to go until it works, which is what the extra HADOOP-9361 will do -testing a lot more than what is in the existing FS contract. I am confident that this is the case, because HADOOP-10533 shows that the last update to s3n caused a lot of regressions. We need the extra tests so that we can be confident that the code works as expected. I think '9361 is nearly ready to go in -if we can get the core spec and abstract tests in, then s3a can pick them up quickly, and there's less to worry about in terms of backwards compatibility in any changes. One thing we could do, quickly, is create a stub hadoop-tools/hadoop-aws module that has nothing but the code structure and the maven dependency this patch could then use that as the basis for code -rather than build changes. I can help do that with a build that doesn't run tests until a tests configuratin resource file is present
          Hide
          ksumit Sumit Kumar added a comment -

          Few observations:

          1. Should it include tests to verify behavior when a user tries to write paths that have multiple "/" such as s3a://foo////bar/delta///gammma//abc? How is this implementation handling this would be interesting because each occurance of "/" would appear a directory for current implementation?
          2. Should it have more logic to consider _$folder$ as marker as well for folders like it's doing for "/" (by creating fake directories). That way the implementation would be exactly the same as current s3n. If item #1 fails, i don't see another approach to solve folder representation in s3.
          3. aws-java-sdk provides https://github.com/aws/aws-sdk-java/blob/master/src/main/java/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.java Should we consider adding ProfileCredentialsProvider as described here https://java.awsblog.com/post/TxRE9V31UFN860/Secure-Local-Development-with-the-ProfileCredentialsProvider This might be a big boon in testing s3 behaviors as unit tests (it's always been really hard keeping code/xmls checked-into code bases having the access and secret keys).
          4. Should S3AFileStatus be more strict in contructor arguments, for example, if it's a directory constructor do we need isdir flag? Should this be more clearer api?
          5. Should it be doing parallel rename/delete operations as well? More specifically could copy operations (while renaming a folder) leverage parallel threads using TransferManager apis?
          6. Should it implement the iterative listing api as well for better performance and build listStatus on top of the same?
          Show
          ksumit Sumit Kumar added a comment - Few observations: Should it include tests to verify behavior when a user tries to write paths that have multiple "/" such as s3a://foo////bar/delta///gammma//abc? How is this implementation handling this would be interesting because each occurance of "/" would appear a directory for current implementation? Should it have more logic to consider _$folder$ as marker as well for folders like it's doing for "/" (by creating fake directories). That way the implementation would be exactly the same as current s3n. If item #1 fails, i don't see another approach to solve folder representation in s3. aws-java-sdk provides https://github.com/aws/aws-sdk-java/blob/master/src/main/java/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.java Should we consider adding ProfileCredentialsProvider as described here https://java.awsblog.com/post/TxRE9V31UFN860/Secure-Local-Development-with-the-ProfileCredentialsProvider This might be a big boon in testing s3 behaviors as unit tests (it's always been really hard keeping code/xmls checked-into code bases having the access and secret keys). Should S3AFileStatus be more strict in contructor arguments, for example, if it's a directory constructor do we need isdir flag? Should this be more clearer api? Should it be doing parallel rename/delete operations as well? More specifically could copy operations (while renaming a folder) leverage parallel threads using TransferManager apis? Should it implement the iterative listing api as well for better performance and build listStatus on top of the same?
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I've just added a patch for HADOOP-10373 which creates a new module for hadoop-amazon support

          1. all the new filesystem code must go in there, to keep it and its dependencies isolated
          2. you can just incorporate that patch and move your code over, to have one unified patch (i.e. no need to depend on 10373 being checked in)
          3. core-site.xml can still retain the settings for the new FS
          4. we'll need documentation in hadoop-amazon/src/main/site/markdown (or site/apt if you really prefer)
          5. the tests are designed to only work if the file test/resources/auth-keys.xml is present. We can mark that as svnignore, gitignore and have the tests load it in. You can then use Test* as the pattern for tests, and be confident that if the -amazon tests run, they are really running. Look at the azure and openstack examples here.

          Having had a quick look at the code

          1. please mark as final fields that are fixed in the constructor
          2. That test FileSystemContractBaseTest.testMkdirsWithUmask() ? just skip it in your subclass; no need to patch the root test
          Show
          stevel@apache.org Steve Loughran added a comment - I've just added a patch for HADOOP-10373 which creates a new module for hadoop-amazon support all the new filesystem code must go in there, to keep it and its dependencies isolated you can just incorporate that patch and move your code over, to have one unified patch (i.e. no need to depend on 10373 being checked in) core-site.xml can still retain the settings for the new FS we'll need documentation in hadoop-amazon/src/main/site/markdown (or site/apt if you really prefer) the tests are designed to only work if the file test/resources/auth-keys.xml is present. We can mark that as svnignore, gitignore and have the tests load it in. You can then use Test* as the pattern for tests, and be confident that if the -amazon tests run, they are really running. Look at the azure and openstack examples here. Having had a quick look at the code please mark as final fields that are fixed in the constructor That test FileSystemContractBaseTest.testMkdirsWithUmask() ? just skip it in your subclass; no need to patch the root test
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Sumit,

          Should it include tests to verify behavior when a user tries to write paths that have multiple "/" such as s3a://foo////bar/delta///gammma//abc?

          +1

          Should it have more logic to consider _$folder$ as marker as well for folders like it's doing for "/"

          if this helps backwards compatibility

          Should we consider adding ProfileCredentialsProvider

          =0. Why not just have the build read in some non-SCM'd file? It could be a properties file, but for the openstack ones we have an svn/git ignored auth-keys.xml file in the repo -and skip the tests entirely if not defined. Alternatively, the maven build could be set up to read a properties file in ~/.m2 containing the values, and patch the settings for each test config based on the values. But the build will still need to skip tests if the properties are unset, because the FilesystemContractBaseTest test suite is Junit4 and not skippable in the code itself (cleanly)

          Show
          stevel@apache.org Steve Loughran added a comment - Sumit, Should it include tests to verify behavior when a user tries to write paths that have multiple "/" such as s3a://foo////bar/delta///gammma//abc? +1 Should it have more logic to consider _$folder$ as marker as well for folders like it's doing for "/" if this helps backwards compatibility Should we consider adding ProfileCredentialsProvider =0. Why not just have the build read in some non-SCM'd file? It could be a properties file, but for the openstack ones we have an svn/git ignored auth-keys.xml file in the repo -and skip the tests entirely if not defined. Alternatively, the maven build could be set up to read a properties file in ~/.m2 containing the values, and patch the settings for each test config based on the values. But the build will still need to skip tests if the properties are unset, because the FilesystemContractBaseTest test suite is Junit4 and not skippable in the code itself (cleanly)
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Should S3AFileStatus be more strict in contructor arguments?

          seems a good idea

          Should it be doing parallel rename/delete operations as well?

          This would reduce the window for incompleted renames being visible , so +1. do it now before expectations are set, but thread count would have to be a configurable param.

          Should it implement the iterative listing api?

          +1

          Show
          stevel@apache.org Steve Loughran added a comment - Should S3AFileStatus be more strict in contructor arguments? seems a good idea Should it be doing parallel rename/delete operations as well? This would reduce the window for incompleted renames being visible , so +1. do it now before expectations are set, but thread count would have to be a configurable param. Should it implement the iterative listing api? +1
          Hide
          ksumit Sumit Kumar added a comment -

          Why not just have the build read in some non-SCM'd file? ...

          Agree with you. My intentions were exactly the same, i.e. to be able to run tests with credentials at a predetermined location. xml might require extra work because aws sdk supports following by default as part of sdk: system properties, environment variables, properties files and the new ProfileCredentialsProvider. I like the idea of skipping tests when credentials are not available, one good idea however would be to report in the test-report that these tests were skipped because of missing credentials file (along with expected location)

          Show
          ksumit Sumit Kumar added a comment - Why not just have the build read in some non-SCM'd file? ... Agree with you. My intentions were exactly the same, i.e. to be able to run tests with credentials at a predetermined location. xml might require extra work because aws sdk supports following by default as part of sdk: system properties, environment variables, properties files and the new ProfileCredentialsProvider. I like the idea of skipping tests when credentials are not available, one good idea however would be to report in the test-report that these tests were skipped because of missing credentials file (along with expected location)
          Hide
          stevel@apache.org Steve Loughran added a comment -
          1. in the openstack builds the entire source tree is skipped when there's no test file: https://github.com/apache/hadoop-common/blob/trunk/hadoop-tools/hadoop-openstack/pom.xml#L40
          1. in the HADOOP-9361 tests of s3n, ftp and swift, individual tests are skipped if the relevant tests are missing.
            hadoop-common-project/hadoop-common/src/test/resources/contract-test-options.xml
            hadoop-tools/hadoop-openstack/src/test/resources/contract-test-options.xml
            

          This could be improved to add messages, a patch would be welcome there

          Show
          stevel@apache.org Steve Loughran added a comment - in the openstack builds the entire source tree is skipped when there's no test file: https://github.com/apache/hadoop-common/blob/trunk/hadoop-tools/hadoop-openstack/pom.xml#L40 in the HADOOP-9361 tests of s3n, ftp and swift, individual tests are skipped if the relevant tests are missing. hadoop-common-project/hadoop-common/src/test/resources/contract-test-options.xml hadoop-tools/hadoop-openstack/src/test/resources/contract-test-options.xml This could be improved to add messages, a patch would be welcome there
          Hide
          mbertozzi Matteo Bertozzi added a comment -

          attached v6, which is the same as v5 but fixes the fs.open().close() case.
          The wrappedObject is initialized only inside the read, so calling close() before a read will throw an NPE. testInputStreamClosedTwice() should reproduce the problem, since is doing fs.open().close()

          @@ -175,7 +175,9 @@ public class S3AInputStream extends FSInputStream {
               }
               super.close();
               closed = true;
          -    wrappedObject.close();
          +    if (wrappedObject != null) {
          +      wrappedObject.close();
          +    }
             }
          
          Show
          mbertozzi Matteo Bertozzi added a comment - attached v6, which is the same as v5 but fixes the fs.open().close() case. The wrappedObject is initialized only inside the read, so calling close() before a read will throw an NPE. testInputStreamClosedTwice() should reproduce the problem, since is doing fs.open().close() @@ -175,7 +175,9 @@ public class S3AInputStream extends FSInputStream { } super .close(); closed = true ; - wrappedObject.close(); + if (wrappedObject != null ) { + wrappedObject.close(); + } }
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Nice catch

          1. IOUtils.closeStream() is the standard wrapper for this logic
          2. probably best to set wrappedObject to null too.
          3. it's "best practise" to make seek() and read() throw some IOException if used when closed, NPEs are sadly too common today.

          I do need for this patch to go into its own module; HADOOP-10373 lays the groundwork for that -could you use that as a foundation and move this source over. We shouldn't have any more external-FS code in -common as it creates classpath clutter for everyone, and stops you updating the s3a lib faster than the core hadoop one

          Show
          stevel@apache.org Steve Loughran added a comment - Nice catch IOUtils.closeStream() is the standard wrapper for this logic probably best to set wrappedObject to null too. it's "best practise" to make seek() and read() throw some IOException if used when closed, NPEs are sadly too common today. I do need for this patch to go into its own module; HADOOP-10373 lays the groundwork for that -could you use that as a foundation and move this source over. We shouldn't have any more external-FS code in -common as it creates classpath clutter for everyone, and stops you updating the s3a lib faster than the core hadoop one
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12653407/HADOOP-10400-6.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4194//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4194//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653407/HADOOP-10400-6.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4194//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4194//console This message is automatically generated.
          Hide
          aloisius Jordan Mendelson added a comment -

          Sorry all, was off for the last month. The upstream version of this code has a few small changes for preliminary retry code to deal with connection closed exceptions during read() (which are sadly being exposed as raw Apache httpcore exceptions derived from IOException). I wanted to use the retry logic that aws itself provides, but it doesn't appear there is a particularly clean way of doing it. As soon as it is done in a semi-sane way, I'll put up a new patch.

          I'm currently integrating all the patches that appear here. Thanks so much for all the contributions! Most should already been in the upstream project https://github.com/Aloisius/hadoop-s3a. The server side encryption one probably needs a better key name, but I can't think of one that better conforms to whatever the current style Hadoop is using right now.

          I'm currently integrating all the other changes there are patches are here for. It is a bit unwieldy to keep track of all these patches to this patch, reintegrate it into my upstream and then recreate a new patch for hadoop trunk each time. If anyone has any suggestions on making this easier, please let me know. The only reason I keep an upstream version is because I use it in production with CDH.

          Show
          aloisius Jordan Mendelson added a comment - Sorry all, was off for the last month. The upstream version of this code has a few small changes for preliminary retry code to deal with connection closed exceptions during read() (which are sadly being exposed as raw Apache httpcore exceptions derived from IOException). I wanted to use the retry logic that aws itself provides, but it doesn't appear there is a particularly clean way of doing it. As soon as it is done in a semi-sane way, I'll put up a new patch. I'm currently integrating all the patches that appear here. Thanks so much for all the contributions! Most should already been in the upstream project https://github.com/Aloisius/hadoop-s3a . The server side encryption one probably needs a better key name, but I can't think of one that better conforms to whatever the current style Hadoop is using right now. I'm currently integrating all the other changes there are patches are here for. It is a bit unwieldy to keep track of all these patches to this patch, reintegrate it into my upstream and then recreate a new patch for hadoop trunk each time. If anyone has any suggestions on making this easier, please let me know. The only reason I keep an upstream version is because I use it in production with CDH.
          Hide
          aloisius Jordan Mendelson added a comment -

          Also Steve Loughran, should I create my next patch on top of your hadoop-amazon?

          Show
          aloisius Jordan Mendelson added a comment - Also Steve Loughran , should I create my next patch on top of your hadoop-amazon?
          Hide
          atm Aaron T. Myers added a comment -

          Jordan Mendelson - yes, you should make your next patch on top of the patch at HADOOP-10373.

          Steve Loughran - are you going to have time to commit HADOOP-10373 soon here? If not, I can take care of it. Given how difficult it clearly is for Jordan to continue to maintain this patch series, it'd be great to get this committed and done with ASAP.

          Show
          atm Aaron T. Myers added a comment - Jordan Mendelson - yes, you should make your next patch on top of the patch at HADOOP-10373 . Steve Loughran - are you going to have time to commit HADOOP-10373 soon here? If not, I can take care of it. Given how difficult it clearly is for Jordan to continue to maintain this patch series, it'd be great to get this committed and done with ASAP.
          Hide
          tsato Takenori Sato added a comment -

          Hi Jordan Mendelson,

          I came from HADOOP-10643, where you suggested that a new improvement over NativeS3FileSystem should be done here.

          So I've made 2 pull requests for your upstream repository.

          1. make endpoint configurable
          https://github.com/Aloisius/hadoop-s3a/pull/8

          jets3t allows a user to configure an endpoint(protocol, host, and port) through jets3t.properties. But a user can't configure without calling a particular method with AmazonSDK. This fix is to simply allow it.

          2. subclass of AbstractFileSystem
          https://github.com/Aloisius/hadoop-s3a/pull/9

          This contains a fix for a similar problem as HADOOP-10643. The difference is that this fix is simpler, and now modification to AbstractFileSystem.
          Also, when using this subclass, HADOOP-8984 becomes obvious, so whose fix is included as well.

          Btw, on my test with Pig, I needed to apply the following fix to make this work.
          "Ensure the file is open before trying to seek"
          https://github.com/Aloisius/hadoop-s3a/pull/6

          Show
          tsato Takenori Sato added a comment - Hi Jordan Mendelson, I came from HADOOP-10643 , where you suggested that a new improvement over NativeS3FileSystem should be done here. So I've made 2 pull requests for your upstream repository. 1. make endpoint configurable https://github.com/Aloisius/hadoop-s3a/pull/8 jets3t allows a user to configure an endpoint(protocol, host, and port) through jets3t.properties. But a user can't configure without calling a particular method with AmazonSDK. This fix is to simply allow it. 2. subclass of AbstractFileSystem https://github.com/Aloisius/hadoop-s3a/pull/9 This contains a fix for a similar problem as HADOOP-10643 . The difference is that this fix is simpler, and now modification to AbstractFileSystem. Also, when using this subclass, HADOOP-8984 becomes obvious, so whose fix is included as well. Btw, on my test with Pig, I needed to apply the following fix to make this work. "Ensure the file is open before trying to seek" https://github.com/Aloisius/hadoop-s3a/pull/6
          Hide
          dsw David S. Wang added a comment -

          HADOOP-10400-7.patch does the following:

          With regards to testing, I ran "mvn clean install -Pnative -DskipTests" from top-level, and "mvn test" in both hadoop-aws and hadoop-azure directories.

          Show
          dsw David S. Wang added a comment - HADOOP-10400 -7.patch does the following: Rebases HADOOP-10400 -6.patch onto current tip-of-trunk, which includes the changes to HADOOP-11074 to move the s3 connector bits over to hadoop-aws. Incorporates HADOOP-10675 , HADOOP-10676 , and HADOOP-10677 , which were fixes on top of previous candidate HADOOP-10400 patches. Corrects jackson 2 dependencies used by the hadoop-aws and hadoop-azure modules. With regards to testing, I ran "mvn clean install -Pnative -DskipTests" from top-level, and "mvn test" in both hadoop-aws and hadoop-azure directories.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12667993/HADOOP-10400-7.patch
          against trunk revision 4be9517.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure:

          org.apache.hadoop.crypto.random.TestOsSecureRandom

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4697//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4697//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667993/HADOOP-10400-7.patch against trunk revision 4be9517. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure: org.apache.hadoop.crypto.random.TestOsSecureRandom +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4697//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4697//console This message is automatically generated.
          Hide
          dsw David S. Wang added a comment -

          The failing unit test is unrelated to this change.

          Show
          dsw David S. Wang added a comment - The failing unit test is unrelated to this change.
          Hide
          atm Aaron T. Myers added a comment -

          I'm +1 on the latest patch that Dave posted, as it now lays things out appropriately as Steve requested. There will of course probably be a bit of follow-up work to do on this front, but I think that we should get this patch in now as it constitutes the vast bulk of the implementation, and we can make any other changes required in smaller, more isolated JIRAs.

          Assuming there are no objections, I'll be committing this later today.

          Show
          atm Aaron T. Myers added a comment - I'm +1 on the latest patch that Dave posted, as it now lays things out appropriately as Steve requested. There will of course probably be a bit of follow-up work to do on this front, but I think that we should get this patch in now as it constitutes the vast bulk of the implementation, and we can make any other changes required in smaller, more isolated JIRAs. Assuming there are no objections, I'll be committing this later today.
          Hide
          dsw David S. Wang added a comment -

          This is the branch-2 backport for HADOOP-10400. It relies on HADOOP-11074 to be applied first. It applied almost entirely cleanly, except for a reference to hadoop-azure in a POM file, which is not in branch-2.

          Show
          dsw David S. Wang added a comment - This is the branch-2 backport for HADOOP-10400 . It relies on HADOOP-11074 to be applied first. It applied almost entirely cleanly, except for a reference to hadoop-azure in a POM file, which is not in branch-2.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          -1: no tests. Sorry -but I don't want that test coverage put off as a "later".

          Tests

          That's FSContractTestBase and the new AbstractFSContract stuff. The tests are there, its a matter of subclassing, running the tests and then fixing where it breaks.

          WRT to tests that would fail if they were part of the patch

          1. fix seek() to throw an EOFException on a negative parameter
          2. I don't think rename() works in all corner cases. Just a guess from time spent understanding the depths of rename and writing tests for it.

          replace IOExceptions with meaningful subclasses

          These are things that the Abstract Contract test will show up if the FS declares that it supports strict exceptions

          1. FSDataOutputStream create() should throw FileAlreadyExistsException
          2. S3AFileSystem.initialize() could throw a stricter IOE on bad bucket, FileNotFoundException?
          3. open() should throw FileAlreadyExistsException...
            ...etc. Trivial to do & the contract tests will show where they are needed.

          other recommended pre-commit changes

          Things that could be done after commit, but which are easier to do now

          1. linelength is too long ... reformat to the hadoop guidelines
          2. move to SLF4j as the log API
          3. the hadoop-aws J POMs should be set up so that instead of excluding the new AWS lib, people only get the hadoop-aws JAR and dependencies if they ask for it ... it shouldn't be another thing hadoop-client drags in.

          Things to not get in to the patch

          • hadoop-common-project/hadoop-common/src/test/resources/core-site.xml shouldn't have those entries

          Longer term,

          1. there's a lot of old->new double config reads. Do we need this? If so, could that deprecated config feature used elsewhere in Hadoop be used?

          the following things should be lifted from the openstack stuff

          1. implementing forward seeks just by skipping characters ... this is significantly faster over long-haul and HTTPS
          2. collection of stats of HTTP verb & operation performance
          Show
          stevel@apache.org Steve Loughran added a comment - -1: no tests. Sorry -but I don't want that test coverage put off as a "later". Tests That's FSContractTestBase and the new AbstractFSContract stuff. The tests are there, its a matter of subclassing, running the tests and then fixing where it breaks. WRT to tests that would fail if they were part of the patch fix seek() to throw an EOFException on a negative parameter I don't think rename() works in all corner cases. Just a guess from time spent understanding the depths of rename and writing tests for it. replace IOExceptions with meaningful subclasses These are things that the Abstract Contract test will show up if the FS declares that it supports strict exceptions FSDataOutputStream create() should throw FileAlreadyExistsException S3AFileSystem.initialize() could throw a stricter IOE on bad bucket, FileNotFoundException? open() should throw FileAlreadyExistsException... ...etc. Trivial to do & the contract tests will show where they are needed. other recommended pre-commit changes Things that could be done after commit, but which are easier to do now linelength is too long ... reformat to the hadoop guidelines move to SLF4j as the log API the hadoop-aws J POMs should be set up so that instead of excluding the new AWS lib, people only get the hadoop-aws JAR and dependencies if they ask for it ... it shouldn't be another thing hadoop-client drags in. Things to not get in to the patch hadoop-common-project/hadoop-common/src/test/resources/core-site.xml shouldn't have those entries Longer term, there's a lot of old->new double config reads. Do we need this? If so, could that deprecated config feature used elsewhere in Hadoop be used? the following things should be lifted from the openstack stuff implementing forward seeks just by skipping characters ... this is significantly faster over long-haul and HTTPS collection of stats of HTTP verb & operation performance
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12668130/HADOOP-10400-branch-2.patch
          against trunk revision 1e68499.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs hadoop-tools/hadoop-aws.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4699//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4699//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668130/HADOOP-10400-branch-2.patch against trunk revision 1e68499. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-client hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs hadoop-tools/hadoop-aws. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4699//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4699//console This message is automatically generated.
          Hide
          dsw David S. Wang added a comment -

          Thanks Steve for your comments.

          I've attached a trunk patch to address your non-longer-term concerns. I believe all of them are addressed.

          I did the same testing as I mentioned as in previous comments. In addition, all of the newly-added s3a FS contract tests pass.

          Show
          dsw David S. Wang added a comment - Thanks Steve for your comments. I've attached a trunk patch to address your non-longer-term concerns. I believe all of them are addressed. I did the same testing as I mentioned as in previous comments. In addition, all of the newly-added s3a FS contract tests pass.
          Hide
          dsw David S. Wang added a comment -

          I should note a few things:

          • I had to change one of the rename tests for s3a in order to conform to s3 behavior - namely that a rename operation of a source directory to a destination directory moves all of the source files to under the destination directory, and the source directory is deleted. It does not move the source directory to underneath the destination directory. I confirmed this behavior empirically. I did this change by subclassing one of the FS contract rename tests.
          • I added an extra knob into the s3n.xml FS contract config file to state that it supports seeks on closed files, because it does. So does s3a. The exceptions come on a subsequent read as is stated in the FS contract code.
          • I'll file a follow-on JIRA for the getting rid of the "old" config names. Steve Loughran, can you file JIRAs for the other longer-term issues you identified, as I am not as clear on them. There's also some other issues raised by other folks in previous comments that should be addressed in follow-on JIRAs, and those folks are probably the right people to file them.
          • Thanks to Sean Busbey for help on some of the finer points of Maven dependencies.
          • Thanks to Juan Yu for help with navigating the FS contract code.
          Show
          dsw David S. Wang added a comment - I should note a few things: I had to change one of the rename tests for s3a in order to conform to s3 behavior - namely that a rename operation of a source directory to a destination directory moves all of the source files to under the destination directory, and the source directory is deleted. It does not move the source directory to underneath the destination directory. I confirmed this behavior empirically. I did this change by subclassing one of the FS contract rename tests. I added an extra knob into the s3n.xml FS contract config file to state that it supports seeks on closed files, because it does. So does s3a. The exceptions come on a subsequent read as is stated in the FS contract code. I'll file a follow-on JIRA for the getting rid of the "old" config names. Steve Loughran , can you file JIRAs for the other longer-term issues you identified, as I am not as clear on them. There's also some other issues raised by other folks in previous comments that should be addressed in follow-on JIRAs, and those folks are probably the right people to file them. Thanks to Sean Busbey for help on some of the finer points of Maven dependencies. Thanks to Juan Yu for help with navigating the FS contract code.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12668509/HADOOP-10400-8.patch
          against trunk revision a0ad975.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 11 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4709//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4709//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668509/HADOOP-10400-8.patch against trunk revision a0ad975. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 11 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4709//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4709//console This message is automatically generated.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          +1

          Looks good, the tests are in there, and we can evolve it in situ now.

          BTW, why did you overrride {{ testRenameDirIntoExistingDir()}} ?

          Show
          stevel@apache.org Steve Loughran added a comment - +1 Looks good, the tests are in there, and we can evolve it in situ now. BTW, why did you overrride {{ testRenameDirIntoExistingDir()}} ?
          Hide
          dsw David S. Wang added a comment -

          Steve Loughran, thanks for the +1.

          I override the test because of the explanation in my previous comment, repeated here for your convenience:

          I had to change one of the rename tests for s3a in order to conform to s3 behavior - namely that a rename operation of a source directory to a destination directory moves all of the source files to under the destination directory, and the source directory is deleted. It does not move the source directory to underneath the destination directory. I confirmed this behavior empirically. I did this change by subclassing one of the FS contract rename tests.

          Separately, also note that I filed HADOOP-11091 to eliminate old configuration parameter names from s3a.

          Show
          dsw David S. Wang added a comment - Steve Loughran , thanks for the +1. I override the test because of the explanation in my previous comment, repeated here for your convenience: I had to change one of the rename tests for s3a in order to conform to s3 behavior - namely that a rename operation of a source directory to a destination directory moves all of the source files to under the destination directory, and the source directory is deleted. It does not move the source directory to underneath the destination directory. I confirmed this behavior empirically. I did this change by subclassing one of the FS contract rename tests. Separately, also note that I filed HADOOP-11091 to eliminate old configuration parameter names from s3a.
          Hide
          dsw David S. Wang added a comment -

          Here's a branch-2 backport for the latest patch HADOOP-10400-8.patch.

          Show
          dsw David S. Wang added a comment - Here's a branch-2 backport for the latest patch HADOOP-10400 -8.patch.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12668578/HADOOP-10400-8-branch-2.patch
          against trunk revision 98588cf.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 11 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs hadoop-tools/hadoop-aws.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4712//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4712//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12668578/HADOOP-10400-8-branch-2.patch against trunk revision 98588cf. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 11 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs hadoop-tools/hadoop-aws. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4712//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4712//console This message is automatically generated.
          Hide
          atm Aaron T. Myers added a comment -

          I've just committed this to trunk and branch-2.

          Thanks a lot, Jordan, for the initial implementation and thanks to Dave for taking the patch over the finish line. Thanks also to Steve and others for all of the reviews they've provided.

          Show
          atm Aaron T. Myers added a comment - I've just committed this to trunk and branch-2. Thanks a lot, Jordan, for the initial implementation and thanks to Dave for taking the patch over the finish line. Thanks also to Steve and others for all of the reviews they've provided.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #682 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/682/)
          HADOOP-10400. Incorporate new S3A FileSystem implementation. Contributed by Jordan Mendelson and Dave Wang. (atm: rev 24d920b80eb3626073925a1d0b6dcf148add8cc0)

          • hadoop-project/pom.xml
          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-hdfs-project/hadoop-hdfs-httpfs/pom.xml
          • hadoop-tools/hadoop-aws/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem
          • hadoop-common-project/hadoop-common/src/main/conf/log4j.properties
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/S3AContract.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractRootDir.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractMkdir.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractOpen.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileStatus.java
          • hadoop-common-project/hadoop-common/src/main/resources/core-default.xml
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/S3AFileSystemContractBaseTest.java
          • hadoop-tools/hadoop-aws/pom.xml
          • hadoop-tools/hadoop-azure/pom.xml
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/AnonymousAWSCredentialsProvider.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractSeek.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractDelete.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractCreate.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/BasicAWSCredentialsProvider.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractRename.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
          • hadoop-tools/hadoop-aws/src/test/resources/contract/s3a.xml
          • hadoop-tools/hadoop-aws/src/test/resources/contract/s3n.xml
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AOutputStream.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #682 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/682/ ) HADOOP-10400 . Incorporate new S3A FileSystem implementation. Contributed by Jordan Mendelson and Dave Wang. (atm: rev 24d920b80eb3626073925a1d0b6dcf148add8cc0) hadoop-project/pom.xml hadoop-common-project/hadoop-common/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs-httpfs/pom.xml hadoop-tools/hadoop-aws/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem hadoop-common-project/hadoop-common/src/main/conf/log4j.properties hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/S3AContract.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractRootDir.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractMkdir.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractOpen.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileStatus.java hadoop-common-project/hadoop-common/src/main/resources/core-default.xml hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/S3AFileSystemContractBaseTest.java hadoop-tools/hadoop-aws/pom.xml hadoop-tools/hadoop-azure/pom.xml hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/AnonymousAWSCredentialsProvider.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractSeek.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractDelete.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractCreate.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/BasicAWSCredentialsProvider.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractRename.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java hadoop-tools/hadoop-aws/src/test/resources/contract/s3a.xml hadoop-tools/hadoop-aws/src/test/resources/contract/s3n.xml hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AOutputStream.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Hdfs-trunk #1873 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1873/)
          HADOOP-10400. Incorporate new S3A FileSystem implementation. Contributed by Jordan Mendelson and Dave Wang. (atm: rev 24d920b80eb3626073925a1d0b6dcf148add8cc0)

          • hadoop-tools/hadoop-azure/pom.xml
          • hadoop-tools/hadoop-aws/src/test/resources/contract/s3a.xml
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
          • hadoop-hdfs-project/hadoop-hdfs-httpfs/pom.xml
          • hadoop-common-project/hadoop-common/src/main/resources/core-default.xml
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractRename.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractSeek.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractDelete.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/S3AContract.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractMkdir.java
          • hadoop-common-project/hadoop-common/src/main/conf/log4j.properties
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/AnonymousAWSCredentialsProvider.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractOpen.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileStatus.java
          • hadoop-tools/hadoop-aws/pom.xml
          • hadoop-tools/hadoop-aws/src/test/resources/contract/s3n.xml
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/BasicAWSCredentialsProvider.java
          • hadoop-tools/hadoop-aws/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AOutputStream.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractCreate.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractRootDir.java
          • hadoop-project/pom.xml
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/S3AFileSystemContractBaseTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #1873 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1873/ ) HADOOP-10400 . Incorporate new S3A FileSystem implementation. Contributed by Jordan Mendelson and Dave Wang. (atm: rev 24d920b80eb3626073925a1d0b6dcf148add8cc0) hadoop-tools/hadoop-azure/pom.xml hadoop-tools/hadoop-aws/src/test/resources/contract/s3a.xml hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java hadoop-hdfs-project/hadoop-hdfs-httpfs/pom.xml hadoop-common-project/hadoop-common/src/main/resources/core-default.xml hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractRename.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractSeek.java hadoop-common-project/hadoop-common/CHANGES.txt hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractDelete.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/S3AContract.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractMkdir.java hadoop-common-project/hadoop-common/src/main/conf/log4j.properties hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/AnonymousAWSCredentialsProvider.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractOpen.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileStatus.java hadoop-tools/hadoop-aws/pom.xml hadoop-tools/hadoop-aws/src/test/resources/contract/s3n.xml hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/BasicAWSCredentialsProvider.java hadoop-tools/hadoop-aws/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AOutputStream.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractCreate.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractRootDir.java hadoop-project/pom.xml hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/S3AFileSystemContractBaseTest.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1898 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1898/)
          HADOOP-10400. Incorporate new S3A FileSystem implementation. Contributed by Jordan Mendelson and Dave Wang. (atm: rev 24d920b80eb3626073925a1d0b6dcf148add8cc0)

          • hadoop-tools/hadoop-aws/pom.xml
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractSeek.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/AnonymousAWSCredentialsProvider.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AOutputStream.java
          • hadoop-common-project/hadoop-common/src/main/resources/core-default.xml
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractOpen.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileStatus.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractRename.java
          • hadoop-tools/hadoop-azure/pom.xml
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
          • hadoop-tools/hadoop-aws/src/test/resources/contract/s3n.xml
          • hadoop-tools/hadoop-aws/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractRootDir.java
          • hadoop-project/pom.xml
          • hadoop-hdfs-project/hadoop-hdfs-httpfs/pom.xml
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractMkdir.java
          • hadoop-tools/hadoop-aws/src/test/resources/contract/s3a.xml
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/S3AFileSystemContractBaseTest.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/BasicAWSCredentialsProvider.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractCreate.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractDelete.java
          • hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
          • hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/S3AContract.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-common-project/hadoop-common/src/main/conf/log4j.properties
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1898 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1898/ ) HADOOP-10400 . Incorporate new S3A FileSystem implementation. Contributed by Jordan Mendelson and Dave Wang. (atm: rev 24d920b80eb3626073925a1d0b6dcf148add8cc0) hadoop-tools/hadoop-aws/pom.xml hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractSeek.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/AnonymousAWSCredentialsProvider.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AOutputStream.java hadoop-common-project/hadoop-common/src/main/resources/core-default.xml hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractOpen.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileStatus.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractRename.java hadoop-tools/hadoop-azure/pom.xml hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java hadoop-tools/hadoop-aws/src/test/resources/contract/s3n.xml hadoop-tools/hadoop-aws/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractRootDir.java hadoop-project/pom.xml hadoop-hdfs-project/hadoop-hdfs-httpfs/pom.xml hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractMkdir.java hadoop-tools/hadoop-aws/src/test/resources/contract/s3a.xml hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/S3AFileSystemContractBaseTest.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/BasicAWSCredentialsProvider.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractCreate.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/TestS3AContractDelete.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/S3AContract.java hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/conf/log4j.properties
          Hide
          thodemoor Thomas Demoor added a comment -

          Patching as much issues as possible before the "launch" of S3a in 2.6 seems like a good idea. Therefore, I would like to remind you of the comment above suggesting that people create JIRA's for the issues they brought up in earlier comments. I am also posting some fixes/ideas in seperate JIRAs.

          Show
          thodemoor Thomas Demoor added a comment - Patching as much issues as possible before the "launch" of S3a in 2.6 seems like a good idea. Therefore, I would like to remind you of the comment above suggesting that people create JIRA's for the issues they brought up in earlier comments. I am also posting some fixes/ideas in seperate JIRAs.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          HADOOP-10714 is a bug that needs fixing, I've promised I'll look at the final patch this weekend just to verify that the latest changes work. We should think about splitting things up into

          1. stuff that if we get wrong now is expensive/painful/impossible to fix (property names...)
          2. bugs that need to be fixed before it is usable
          3. low-risk changes: documentation, more tests.
          4. features that could be added later without backwards-compatibility problems

          then focus on the 1st three, though of course anyone is free to work on #4 too. I can't promise any time to review features; I'll try to look at the critical bugs when I can (this is a spare time activity for me)

          Show
          stevel@apache.org Steve Loughran added a comment - HADOOP-10714 is a bug that needs fixing, I've promised I'll look at the final patch this weekend just to verify that the latest changes work. We should think about splitting things up into stuff that if we get wrong now is expensive/painful/impossible to fix (property names...) bugs that need to be fixed before it is usable low-risk changes: documentation, more tests. features that could be added later without backwards-compatibility problems then focus on the 1st three, though of course anyone is free to work on #4 too. I can't promise any time to review features; I'll try to look at the critical bugs when I can (this is a spare time activity for me)
          Hide
          thodemoor Thomas Demoor added a comment -

          One of the things I think we should adress pre 2.6 is the Pig error reported above . Should we simply throw the error on seeking in a closed file and change the contract xml or does that brake other parts of the ecosystem?

          Show
          thodemoor Thomas Demoor added a comment - One of the things I think we should adress pre 2.6 is the Pig error reported above . Should we simply throw the error on seeking in a closed file and change the contract xml or does that brake other parts of the ecosystem?
          Hide
          jyu@cloudera.com Juan Yu added a comment -

          I ran the seek test against s3a and s3n many times and don't have problem.
          The contract tests have a specific test case for seeking on closed file. tAbstractContractSeekTest#estSeekReadClosedFile
          and it passed for both s3a and s3n because they "support seek on closed file", be aware you will still gonna get IOException though.

          Show
          jyu@cloudera.com Juan Yu added a comment - I ran the seek test against s3a and s3n many times and don't have problem. The contract tests have a specific test case for seeking on closed file. tAbstractContractSeekTest#estSeekReadClosedFile and it passed for both s3a and s3n because they "support seek on closed file", be aware you will still gonna get IOException though.
          Hide
          jyu@cloudera.com Juan Yu added a comment -

          correction, the test is AbstractContractSeekTest#testSeekReadClosedFile

          Show
          jyu@cloudera.com Juan Yu added a comment - correction, the test is AbstractContractSeekTest#testSeekReadClosedFile
          Hide
          thodemoor Thomas Demoor added a comment -

          I see that the first attempt at constructing a 2.6 release candidate has started. That would thus be the "official launch" of s3a.

          I have some small patches (HADOOP-11262, HADOOP-11261, HADOOP-11171) awaiting review (Steve Loughran has provided feedback on the latter two issues and his fixes have been incorporated). I feel that lauching a new filesystem without YARN support could hamper its adoption so it would be nice if these make it to 2.6. If any of the committers have time to take a look, I am available to quickly address any issues should they arise.

          Show
          thodemoor Thomas Demoor added a comment - I see that the first attempt at constructing a 2.6 release candidate has started. That would thus be the "official launch" of s3a. I have some small patches ( HADOOP-11262 , HADOOP-11261 , HADOOP-11171 ) awaiting review ( Steve Loughran has provided feedback on the latter two issues and his fixes have been incorporated). I feel that lauching a new filesystem without YARN support could hamper its adoption so it would be nice if these make it to 2.6. If any of the committers have time to take a look, I am available to quickly address any issues should they arise.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          -thomas; only just caught up on this...too busy with release things. I'll have a look with the goal being 2.8 (jan/feb)

          Show
          stevel@apache.org Steve Loughran added a comment - -thomas; only just caught up on this...too busy with release things. I'll have a look with the goal being 2.8 (jan/feb)
          Hide
          superwai Jerry Lam added a comment -

          Hi guys, I'm using s3a for quite sometimes. Today I found an issue with the finishedWrite. Apparently the only thing it does is to delete /

          {bucket}

          /

          {key}

          /. Why is this necessary? I created some hundred thousands of files on s3 using Spark. All files are created within 5 minutes but the job cannot be completed because the finishedWrite takes over a hour to run. Is it safe not to delete unnecessary files? Thanks!

          Show
          superwai Jerry Lam added a comment - Hi guys, I'm using s3a for quite sometimes. Today I found an issue with the finishedWrite. Apparently the only thing it does is to delete / {bucket} / {key} /. Why is this necessary? I created some hundred thousands of files on s3 using Spark. All files are created within 5 minutes but the job cannot be completed because the finishedWrite takes over a hour to run. Is it safe not to delete unnecessary files? Thanks!
          Hide
          superwai Jerry Lam added a comment -

          I screwed up the comment above:

          what apparently it is trying to delete is /(BUCKET_NAME)/(KEY)/

          Show
          superwai Jerry Lam added a comment - I screwed up the comment above: what apparently it is trying to delete is /(BUCKET_NAME)/(KEY)/
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Jerry, closed issues aren't the place to raise things...ask on hadoop-user and then escalate to a new jira, after scanning the other ones.

          is the issue with a trailing space? or is it that it's doing it file-by-file. Because if its the latter, what's all we can do on S3: delete files one by one. We've discussed (HADOOP-9555) having some async operations, because the fastest way to rm stuff on s3 is to set the time to delete flag to a few seconds to and then leave it. What comes first though, is making it clear to callers that a filesystem is an object store, and some operations (rename, delete) are O(files) , O(files*data), or worse

          Show
          stevel@apache.org Steve Loughran added a comment - Jerry, closed issues aren't the place to raise things...ask on hadoop-user and then escalate to a new jira, after scanning the other ones. is the issue with a trailing space? or is it that it's doing it file-by-file. Because if its the latter, what's all we can do on S3: delete files one by one. We've discussed ( HADOOP-9555 ) having some async operations, because the fastest way to rm stuff on s3 is to set the time to delete flag to a few seconds to and then leave it. What comes first though, is making it clear to callers that a filesystem is an object store, and some operations (rename, delete) are O(files) , O(files*data), or worse
          Hide
          superwai Jerry Lam added a comment -

          Hi Steve, thank you for the response. Sorry I posted it here because this is the only ticket that is relevant to the issue. I will post the question in hadoop-user as you suggested.

          For your reference, the problem is in https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L1142

          The deleteUnnecessaryFakeDirectories method takes a long time to execute. Most of time is spent in https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L1154

          I have a spark job that create over +100000 files in a single job. All files are nicely created and very quick. BUT the finishedWrite method never returns. The stacktrace is as follows:

          java.net.SocketInputStream.socketRead0(Native Method)
          java.net.SocketInputStream.read(SocketInputStream.java:152)
          java.net.SocketInputStream.read(SocketInputStream.java:122)
          org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
          org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
          org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
          org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
          org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
          org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
          org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
          org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
          org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
          org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
          com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:66)
          org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
          org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
          org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
          org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
          org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
          org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
          com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
          com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
          com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
          com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)
          com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:604)
          org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:962)
          org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:1147)
          org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:1136)
          org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:142)
          org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
          org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
          org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:400)
          org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:117)
          org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
          org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
          org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:369)
          org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
          org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
          org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
          org.apache.spark.scheduler.Task.run(Task.scala:88)
          org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
          java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          java.lang.Thread.run(Thread.java:745)

          Show
          superwai Jerry Lam added a comment - Hi Steve, thank you for the response. Sorry I posted it here because this is the only ticket that is relevant to the issue. I will post the question in hadoop-user as you suggested. For your reference, the problem is in https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L1142 The deleteUnnecessaryFakeDirectories method takes a long time to execute. Most of time is spent in https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L1154 I have a spark job that create over +100000 files in a single job. All files are nicely created and very quick. BUT the finishedWrite method never returns. The stacktrace is as follows: java.net.SocketInputStream.socketRead0(Native Method) java.net.SocketInputStream.read(SocketInputStream.java:152) java.net.SocketInputStream.read(SocketInputStream.java:122) org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140) org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260) org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251) org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197) org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271) com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:66) org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682) org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486) org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384) com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480) com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:604) org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:962) org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:1147) org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:1136) org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:142) org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:400) org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:117) org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112) org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101) org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:369) org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) org.apache.spark.scheduler.Task.run(Task.scala:88) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745)
          Hide
          stevel@apache.org Steve Loughran added a comment -

          File a new JIRA under HADOOP-11694

          note that Databricks provide a direct output committer for writing data back to object stores, this is one without any rename

          Show
          stevel@apache.org Steve Loughran added a comment - File a new JIRA under HADOOP-11694 note that Databricks provide a direct output committer for writing data back to object stores, this is one without any rename
          Hide
          Thomas Demoor Thomas Demoor added a comment -

          Alternatively, the last patch uploaded in HADOOP-9565 also does direct output commiting.

          Show
          Thomas Demoor Thomas Demoor added a comment - Alternatively, the last patch uploaded in HADOOP-9565 also does direct output commiting.
          Hide
          sebastianherold Sebastian Herold added a comment -

          What about the ProfileCredentialsProvider? Do you plan to add it to the CredentailsProviderChain in S3AFileSystem?

          Show
          sebastianherold Sebastian Herold added a comment - What about the ProfileCredentialsProvider ? Do you plan to add it to the CredentailsProviderChain in S3AFileSystem ?

            People

            • Assignee:
              aloisius Jordan Mendelson
              Reporter:
              aloisius Jordan Mendelson
            • Votes:
              1 Vote for this issue
              Watchers:
              35 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development