Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32766

s3a: bucket names with dots cannot be used

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Invalid
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: Input/Output
    • Labels:
      None

      Description

      Running vanilla spark with

      --packages=org.apache.hadoop:hadoop-aws:x.y.z

      I cannot read from S3, if the bucket name contains a dot (a valid name).

      A minimal reproducible example looks like this
      from pyspark.sql import SparkSession
      import pyspark.sql.functions as f
      if _name_ == '_main_':
        spark = (SparkSession
          .builder
          .appName('my_app')
          .master("local[*]")
          .getOrCreate()
        )

        spark.read.csv("s3a://test-bucket-name-v1.0/foo.csv")

      Or just launch a spark-shell with `--packages=(...)hadoop-aws(...)` and read that CSV. I created the same bucket without the period and it worked fine.

      Now I'm not sure whether this is a thing of prepping the path names and passing them to the aws-sdk, or whether the fault is within the SDK itself. I am not Java savvy to investigate the issue further, but I tried to make the repro as short as possible.


      I get different errors depending on which Hadoop distributions I use. If I use the default PySpark distribution (which includes Hadoop 2), I get the following (using hadoop-aws:2.7.4)

      scala> spark.read.csv("s3a://okokes-test-v2.5/foo.csv").show()
      java.lang.IllegalArgumentException: The bucketName parameter must be specified.
      {{ at com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)}}
      {{ at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)}}
      {{ at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)}}
      {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)}}
      {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)}}
      {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)}}
      {{ at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)}}
      {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)}}
      {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)}}
      {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)}}
      {{ at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
      {{ at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}}
      {{ at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}}
      {{ at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}}
      {{ at scala.Option.getOrElse(Option.scala:189)}}
      {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}}
      {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}}
      {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}}
      {{ ... 47 elided}}

      When I downloaded 3.0.0 with Hadoop 3 and ran a spark-shell there, I got this error (with hadoop-aws:3.2.0):

      java.lang.NullPointerException: null uri host.
      {{ at java.base/java.util.Objects.requireNonNull(Objects.java:246)}}
      {{ at org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)}}
      {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)}}
      {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)}}
      {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)}}
      {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)}}
      {{ at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)}}
      {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)}}
      {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)}}
      {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)}}
      {{ at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
      {{ at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}}
      {{ at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}}
      {{ at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}}
      {{ at scala.Option.getOrElse(Option.scala:189)}}
      {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}}
      {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}}
      {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}}
      {{ ... 47 elided}}

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                ondrej Ondrej Kokes
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: