Hadoop Common
  1. Hadoop Common
  2. HADOOP-5805

problem using top level s3 buckets as input/output directories

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.18.3
    • Fix Version/s: 0.21.0
    • Component/s: fs/s3
    • Labels:
      None
    • Environment:

      ec2, cloudera AMI, 20 nodes

    • Hadoop Flags:
      Reviewed

      Description

      When I specify top level s3 buckets as input or output directories, I get the following exception.

      hadoop jar subject-map-reduce.jar s3n://infocloud-input s3n://infocloud-output

      java.lang.IllegalArgumentException: Path must be absolute: s3n://infocloud-output
      at org.apache.hadoop.fs.s3native.NativeS3FileSystem.pathToKey(NativeS3FileSystem.java:246)
      at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:319)
      at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
      at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:109)
      at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:738)
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026)
      at com.evri.infocloud.prototype.subjectmapreduce.SubjectMRDriver.run(SubjectMRDriver.java:63)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
      at com.evri.infocloud.prototype.subjectmapreduce.SubjectMRDriver.main(SubjectMRDriver.java:25)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
      at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
      at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

      The workaround is to specify input/output buckets with sub-directories:

      hadoop jar subject-map-reduce.jar s3n://infocloud-input/input-subdir s3n://infocloud-output/output-subdir

      1. HADOOP-5805-2.patch
        1 kB
        Tom White
      2. HADOOP-5805-1.patch
        1 kB
        Ian Nowland
      3. HADOOP-5805-0.patch
        1 kB
        Ian Nowland

        Activity

        Hide
        Ian Nowland added a comment -

        There are two problems here.

        The first is that S3N currently requires a terminating slash on the URI to indicate the root of a bucket. That is it accepts s3n://infocloud-input/ but not s3n://infocloud-input. This is fixed by the attached patch which allows either form to be used.

        This fixes the input bucket case but not the output one.

        The second problem is then that S3N requires any bucket to exist for it to be able to use it. But if you attempt to use its "root" as the output then you will get the standard Hadoop behavior of throwing an FileAlreadyExistsException exception from FileOutputFormat, even if the bucket is empty, as the root directory "/" of the bucket does exist. To me the ideal fix for this second problem is to change FileOutputFormat to not throw if the output directory exists but is empty. However that seems a fairly large change to the established behavior, so I did not include it with the more trivial patch.

        As an aside since each AWS account only gets 100 buckets that it can use, you generally don't want to be writing the output of each job to a new bucket anyway.

        Show
        Ian Nowland added a comment - There are two problems here. The first is that S3N currently requires a terminating slash on the URI to indicate the root of a bucket. That is it accepts s3n://infocloud-input/ but not s3n://infocloud-input. This is fixed by the attached patch which allows either form to be used. This fixes the input bucket case but not the output one. The second problem is then that S3N requires any bucket to exist for it to be able to use it. But if you attempt to use its "root" as the output then you will get the standard Hadoop behavior of throwing an FileAlreadyExistsException exception from FileOutputFormat, even if the bucket is empty, as the root directory "/" of the bucket does exist. To me the ideal fix for this second problem is to change FileOutputFormat to not throw if the output directory exists but is empty. However that seems a fairly large change to the established behavior, so I did not include it with the more trivial patch. As an aside since each AWS account only gets 100 buckets that it can use, you generally don't want to be writing the output of each job to a new bucket anyway.
        Hide
        Tom White added a comment -

        This looks like a good fix. The test should do an assert to check that it gets back an appropriate FileStatus object.

        The patch needs to be regenerated since the tests have moved from src/test to src/test/core.

        For the second problem, you could subclass your output format to override checkOutputSpecs() so it doesn't throw FileAlreadyExistsException. But I agree it would be nicer to deal with this generally. Perhaps open a separate Jira as it would affect more than NativeS3FileSystem.

        Show
        Tom White added a comment - This looks like a good fix. The test should do an assert to check that it gets back an appropriate FileStatus object. The patch needs to be regenerated since the tests have moved from src/test to src/test/core. For the second problem, you could subclass your output format to override checkOutputSpecs() so it doesn't throw FileAlreadyExistsException. But I agree it would be nicer to deal with this generally. Perhaps open a separate Jira as it would affect more than NativeS3FileSystem.
        Hide
        Ian Nowland added a comment -

        New patch against trunk. Moved test and added assert.

        Also created https://issues.apache.org/jira/browse/HADOOP-5889

        Show
        Ian Nowland added a comment - New patch against trunk. Moved test and added assert. Also created https://issues.apache.org/jira/browse/HADOOP-5889
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12408752/HADOOP-5805-1.patch
        against trunk revision 777330.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/375/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12408752/HADOOP-5805-1.patch against trunk revision 777330. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/375/console This message is automatically generated.
        Hide
        Tom White added a comment -

        For some reason the patch didn't apply. Here's a regenerated version.

        Show
        Tom White added a comment - For some reason the patch didn't apply. Here's a regenerated version.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12409019/HADOOP-5805-2.patch
        against trunk revision 779338.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/415/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/415/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/415/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/415/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409019/HADOOP-5805-2.patch against trunk revision 779338. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/415/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/415/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/415/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/415/console This message is automatically generated.
        Hide
        Tom White added a comment -

        I've just committed this. Thanks Ian!

        (The contrib test failure was unrelated.)

        Show
        Tom White added a comment - I've just committed this. Thanks Ian! (The contrib test failure was unrelated.)
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #863 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/863/ )

          People

          • Assignee:
            Ian Nowland
            Reporter:
            Arun Jacob
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development