Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20061

Reading a file with colon (:) from S3 fails with URISyntaxException

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.1.0
    • None
    • Structured Streaming
    • None
    • EC2, AWS

    Description

      When reading a bunch of files from s3 using wildcards, it fails with the following exception:

      scala> val fn = "s3a://mybucket/path/*/"
      scala> val ds = spark.readStream.schema(schema).json(fn)
      
      java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
        at org.apache.hadoop.fs.Path.initialize(Path.java:205)
        at org.apache.hadoop.fs.Path.<init>(Path.java:171)
        at org.apache.hadoop.fs.Path.<init>(Path.java:93)
        at org.apache.hadoop.fs.Globber.glob(Globber.java:241)
        at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657)
        at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:237)
        at org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:243)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:131)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:127)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:344)
        at org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$lzycompute$1(DataSource.scala:127)
        at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$tempFileIndex$1(DataSource.scala:124)
        at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:138)
        at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:229)
        at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
        at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
        at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
        at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
        at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:133)
        at org.apache.spark.sql.streaming.DataStreamReader.json(DataStreamReader.scala:181)
        ... 50 elided
      Caused by: java.net.URISyntaxException: Relative path in absolute URI: 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
        at java.net.URI.checkPath(URI.java:1823)
        at java.net.URI.<init>(URI.java:745)
        at org.apache.hadoop.fs.Path.initialize(Path.java:202)
        ... 73 more
      

      The file in question sits at the root of s3a://mybucket/path/

      aws s3 ls s3://mybucket/path/
      
                                 PRE subfolder1/
                                 PRE subfolder2/
      ...
      2017-01-06 20:33:46       1383 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
      ...
      

      Removing the wildcard from path make it work but it obviously does misses all files in subdirectories.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              FlamingMike Michel Lemay
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: