Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.0.0, 2.0.1, 2.1.0, 2.2.0
-
None
Description
If you want to open a file that its name is like
"*{*}*.*"
or
"*[*]*.*"
using CSV format, you will get the "org.apache.spark.sql.AnalysisException: Path does not exist" whether the file is a local file or on hdfs.
This bug can be reproduced on master and all other Spark 2 branches.
To reproduce:
- Create a file like "test
{00-1}.txt" on a local directory (like in /Users/reza/test/test{00-1}
.txt)
- Run spark-shell
- Execute this command:
val df=spark.read.option("header","false").csv("/Users/reza/test/*.txt")
You will see the following stack trace:
org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/reza/test/test\{00-01\}.txt; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:367) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.readText(CSVFileFormat.scala:208) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:63) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:173) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:158) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:423) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:360) ... 48 elided
If you put the file on hadoop (like on /user/root) when you try to run the following:
val df=spark.read.option("header", false).csv("/user/root/*.txt")
You will get the following exception:
org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://hosturl/user/root/test\{00-01\}.txt matches 0 files at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1297) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) at org.apache.spark.rdd.RDD.take(RDD.scala:1292) at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1332) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) at org.apache.spark.rdd.RDD.first(RDD.scala:1331) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.findFirstLine(CSVFileFormat.scala:167) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:59) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:421) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:421) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:420) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:413) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:349) ... 48 elided