Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-17241

s3a: bucket names which aren't parseable hostnames unsupported

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 2.7.4, 3.2.0
    • None
    • fs/s3
    • None

    Description

      Hi there,
      I'm using Spark to read some data from S3 and I encountered an error when reading from a bucket that contains a period (e.g. `s3a://okokes-test-v1.1/foo.csv`). I have close to zero Java experience, but I've tried to trace this as well as I can. Apologies for any misunderstanding on my part.

      Edit: the title is a little misleading - buckets can contain dots and s3a will work, but only if these bucket names conform to hostname restrictions - e.g. `s3a://foo.bar/bak.csv` would work, but my case - `okokes-test-v1.1` does not, because `1` is not conform to a top level domain pattern.

      Using hadoop-aws:3.2.0, I get the following:

      java.lang.NullPointerException: null uri host.
       at java.base/java.util.Objects.requireNonNull(Objects.java:246)
       at org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)
       at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)
       at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)
       at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
       at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
       at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
       at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
       at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
       at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
       at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
       at scala.Option.getOrElse(Option.scala:189)
       at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
       at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
       at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)
       ... 47 elided

      hadoop-aws:2.7.4 did lead to a similar outcome

      java.lang.IllegalArgumentException: The bucketName parameter must be specified.
       at com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)
       at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)
       at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
       at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
       at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
       at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
       at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
       at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
       at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
       at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
       at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
       at scala.Option.getOrElse(Option.scala:189)
       at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
       at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
       at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)
       ... 47 elided

      I investigated the issue a little bit and found buildFSURI to require the host to be not null - see S3xLoginHelper.java - but in my case the host is null and the authority part of the URL should be used. When I checked AWS' handling of this case, they seem to be using authority for all s3:// paths - https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/AmazonS3URI.java#L85.

      I verified this URI in a Scala shell (openjdk 1.8.0_252)

       

      scala> (new URI("s3a://okokes-test-v1.1/foo.csv")).getHost()
      val res1: String = null
      scala> (new URI("s3a://okokes-test-v1.1/foo.csv")).getAuthority()
      val res2: String = okokes-test-v1.1
      

       

      Oh and this is indeed a bucket name. Not only did I create it in the console, but there's also enough documentation on the topic - https://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html#bucketnamingrules

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ondrej Ondrej Kokes
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: