[SPARK-27442] ParquetFileFormat fails to read column named with invalid characters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0, 2.4.1
Fix Version/s: 3.3.0
Component/s: Input/Output
Labels:
None

Description

When reading a parquet file which contains characters considered invalid, the reader fails with exception:

Name: org.apache.spark.sql.AnalysisException
Message: Attribute name "..." contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.

Spark should not be able to write such files, but it should be able to read it (and allow the user to correct it). However, possible workarounds (such as using alias to rename the column, or forcing another schema) do not work, since the check is done on the input.

(Possible fix: remove superficial ParquetWriteSupport.setSchema(requiredSchema, hadoopConf) from buildReaderWithPartitionValues ?)

Attachments

Issue Links

links to

[Github] Pull Request #35229 (AngersZhuuuu)

Activity

People

Assignee:: angerszhu

Reporter:: Jan Vršovský

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 11/Apr/19 12:11

Updated:: 12/Dec/22 18:11

Resolved:: 21/Jan/22 07:29