Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32208

Spark SQL throw Illegal character exception when load certain abnormal path of HDFS

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 2.4.3, 3.2.0
    • None
    • SQL
    • None

    Description

      In the distributed hdfs storage system,Space and other special character are allowed in the path:

      hdfs://ns1/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000
      
      

      When we load data by using

      org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
      org.apache.spark.sql.execution.datasources.orcOrcFileFormat.scala
      org.apache.spark.sql.hive.orc.OrcFileFormat 

      , exception may throw as below:

      Caused by: java.net.URISyntaxException: Illegal character in path at index 136: hdfs://ns1/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000
      at java.net.URI$Parser.fail(URI.java:2848)
      at java.net.URI$Parser.checkChars(URI.java:3021)
      at java.net.URI$Parser.parseHierarchical(URI.java:3105)
      at java.net.URI$Parser.parse(URI.java:3053)
      at java.net.URI.<init>(URI.java:588)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
      anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)atorg.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
      anonfunbuildReaderWithPartitionValues1.apply(ParquetFileFormat.scala:352)
      at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.orgapachesparksqlexecutiondatasourcesFileScanRDD
      anon
      readCurrentFile(FileScanRDD.scala:124)
      at org.apache.spark.sql.execution.datasources.FileScanRDD
      anon$1.nextIterator(FileScanRDD.scala:177)atorg.apache.spark.sql.execution.datasources.FileScanRDD
      anon1.hasNext(FileScanRDD.scala:101)atorg.apache.spark.sql.execution.datasources.FileFormatWriteranonfunorgapachesparksqlexecutiondatasourcesFileFormatWriter
      executeTask$3.apply(FileFormatWriter.scala:252)atorg.apache.spark.sql.execution.datasources.FileFormatWriter
      anonfunorgapachesparksqlexecutiondatasourcesFileFormatWriterexecuteTask3.apply(FileFormatWriter.scala:250)
      at org.apache.spark.util.Utils.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)atorg.apache.spark.sql.execution.datasources.FileFormatWriter.orgapachesparksqlexecutiondatasourcesFileFormatWriter$$executeTask(FileFormatWriter.scala:256)
      ... 10 more
      
      

       Hdfs  has provided serveral  construct function to build path:

      https://github.com/apache/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Path.java

      We could fall back to  construct a path from a String rather than URI.

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            southernriver chenliang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: