Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23814

Couldn't read file with colon in name and new line character in one of the field.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.2.0
    • None
    • Spark Core, Spark Shell
    • None

    Description

      When the file name has colon and new line character in data, while reading using spark.read.option("multiLine","true").csv("s3n://DirectoryPath/") function. It is throwing "java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz" error. If we remove the option("multiLine","true"), it is working just fine though the file name has colon in it. It is working fine, If i apply this option option("multiLine","true") on any other file which doesn't have colon in it. But when both are present (colon in file name and new line in the data), it's not working.

      java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz

        at org.apache.hadoop.fs.Path.initialize(Path.java:205)

        at org.apache.hadoop.fs.Path.<init>(Path.java:171)

        at org.apache.hadoop.fs.Path.<init>(Path.java:93)

        at org.apache.hadoop.fs.Globber.glob(Globber.java:253)

        at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)

        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)

        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)

        at org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51)

        at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:46)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

        at scala.Option.getOrElse(Option.scala:121)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

        at scala.Option.getOrElse(Option.scala:121)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

        at scala.Option.getOrElse(Option.scala:121)

        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

        at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)

        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)

        at org.apache.spark.rdd.RDD.take(RDD.scala:1327)

        at org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224)

        at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)

        at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)

        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)

        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)

        at scala.Option.orElse(Option.scala:289)

        at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176)

        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)

        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)

        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)

        ... 48 elided

      Caused by: java.net.URISyntaxException: Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz

        at java.net.URI.checkPath(URI.java:1823)

        at java.net.URI.<init>(URI.java:745)

        at org.apache.hadoop.fs.Path.initialize(Path.java:202)

        ... 86 more

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              abharath9 bharath kumar avusherla
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: