Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.1.0
-
None
-
None
-
EC2, AWS
Description
When reading a bunch of files from s3 using wildcards, it fails with the following exception:
scala> val fn = "s3a://mybucket/path/*/" scala> val ds = spark.readStream.schema(schema).json(fn) java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.<init>(Path.java:171) at org.apache.hadoop.fs.Path.<init>(Path.java:93) at org.apache.hadoop.fs.Globber.glob(Globber.java:241) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657) at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:237) at org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:243) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:131) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:127) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$lzycompute$1(DataSource.scala:127) at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$tempFileIndex$1(DataSource.scala:124) at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:138) at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:229) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87) at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:133) at org.apache.spark.sql.streaming.DataStreamReader.json(DataStreamReader.scala:181) ... 50 elided Caused by: java.net.URISyntaxException: Relative path in absolute URI: 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json at java.net.URI.checkPath(URI.java:1823) at java.net.URI.<init>(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:202) ... 73 more
The file in question sits at the root of s3a://mybucket/path/
aws s3 ls s3://mybucket/path/
PRE subfolder1/
PRE subfolder2/
...
2017-01-06 20:33:46 1383 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
...
Removing the wildcard from path make it work but it obviously does misses all files in subdirectories.
Attachments
Issue Links
- duplicates
-
HADOOP-14235 S3A Path does not understand colon (:) when globbing
- Resolved
- relates to
-
HADOOP-14217 Object Storage: support colon in object path
- Open