Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19340

Opening a file in CSV format will result in an exception if the filename contains special characters

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.0.0, 2.0.1, 2.1.0, 2.2.0
    • None
    • SQL

    Description

      If you want to open a file that its name is like

       "*{*}*.*" 

      or

       "*[*]*.*" 

      using CSV format, you will get the "org.apache.spark.sql.AnalysisException: Path does not exist" whether the file is a local file or on hdfs.

      This bug can be reproduced on master and all other Spark 2 branches.
      To reproduce:

      1. Create a file like "test {00-1}.txt" on a local directory (like in /Users/reza/test/test{00-1}

        .txt)

      2. Run spark-shell
      3. Execute this command:
        val df=spark.read.option("header","false").csv("/Users/reza/test/*.txt")
        

      You will see the following stack trace:

      org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/reza/test/test\{00-01\}.txt;
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:367)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:344)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360)
        at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.readText(CSVFileFormat.scala:208)
        at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:63)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174)
        at scala.Option.orElse(Option.scala:289)
        at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:173)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:158)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:423)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:360)
        ... 48 elided
      

      If you put the file on hadoop (like on /user/root) when you try to run the following:

      val df=spark.read.option("header", false).csv("/user/root/*.txt")
      

      You will get the following exception:

      org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://hosturl/user/root/test\{00-01\}.txt matches 0 files
        at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
        at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1297)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
        at org.apache.spark.rdd.RDD.take(RDD.scala:1292)
        at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1332)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
        at org.apache.spark.rdd.RDD.first(RDD.scala:1331)
        at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.findFirstLine(CSVFileFormat.scala:167)
        at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:59)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:421)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:421)
        at scala.Option.orElse(Option.scala:289)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:420)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:413)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:349)
        ... 48 elided
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            rezasafi Reza Safi
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: