[SPARK-32208] Spark SQL throw Illegal character exception when load certain abnormal path of HDFS - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.3, 3.2.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

In the distributed hdfs storage system，Space and other special character are allowed in the path：

hdfs://ns1/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000

When we load data by using

org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
org.apache.spark.sql.execution.datasources.orcOrcFileFormat.scala
org.apache.spark.sql.hive.orc.OrcFileFormat

, exception may throw as below:

Caused by: java.net.URISyntaxException: Illegal character in path at index 136: hdfs://ns1/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parseHierarchical(URI.java:3105)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.<init>(URI.java:588)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)atorg.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
anonfunbuildReaderWithPartitionValues1.apply(ParquetFileFormat.scala:352)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.orgapachesparksqlexecutiondatasourcesFileScanRDD
anon
readCurrentFile(FileScanRDD.scala:124)
at org.apache.spark.sql.execution.datasources.FileScanRDD
anon$1.nextIterator(FileScanRDD.scala:177)atorg.apache.spark.sql.execution.datasources.FileScanRDD
anon1.hasNext(FileScanRDD.scala:101)atorg.apache.spark.sql.execution.datasources.FileFormatWriteranonfunorgapachesparksqlexecutiondatasourcesFileFormatWriter
executeTask$3.apply(FileFormatWriter.scala:252)atorg.apache.spark.sql.execution.datasources.FileFormatWriter
anonfunorgapachesparksqlexecutiondatasourcesFileFormatWriterexecuteTask3.apply(FileFormatWriter.scala:250)
at org.apache.spark.util.Utils.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)atorg.apache.spark.sql.execution.datasources.FileFormatWriter.orgapachesparksqlexecutiondatasourcesFileFormatWriter$$executeTask(FileFormatWriter.scala:256)
... 10 more

Hdfs has provided serveral construct function to build path:

https://github.com/apache/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Path.java

We could fall back to construct a path from a String rather than URI.

Attachments

Issue Links

links to

[Github] Pull Request #30707 (southernriver)

Activity

People

Assignee:: Unassigned

Reporter:: chenliang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Jul/20 08:53

Updated:: 12/Dec/22 18:11