[SPARK-34751] Parquet with invalid chars on column name reads double as null when a clean schema is applied - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.4.3, 3.1.1
Fix Version/s: None
Component/s: Input/Output
Labels:
None
Environment:

Pyspark 2.4.3

AWS Glue Dev Endpoint EMR

Description

I have a parquet file that has data with invalid column names on it. Reference(https://issues.apache.org/jira/browse/SPARK-27442) Here is the file attached with this ticket.

I tried to load this file with

df = glue_context.read.parquet('invalid_columns_double.parquet')

df = df.withColumnRenamed('COL 1', 'COL_1')

df = df.withColumnRenamed('COL,2', 'COL_2')

df = df.withColumnRenamed('COL;3', 'COL_3')

and so on.

Now if i call

df.show()

it throws this exception that is still pointing to the old column name.

pyspark.sql.utils.AnalysisException: 'Attribute name "COL 1" contains invalid character(s) among " ,;{}()
n
t=". Please use alias to rename it.;'

When i read about it in some blogs, there was suggestion to re-read the same parquet with new schema applied. So i did

df = glue_context.read.schema(df.schema).parquet('invalid_columns_double.parquet')

and it works, but all the data in the dataframe are null. The same works for String datatypes

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

invalid_columns_double.parquet
15/Mar/21 19:10
5 kB
Nivas Umapathy

Issue Links

duplicates

SPARK-34750 Parquet with invalid chars on column name reads double as null when a clean schema is applied

Open

Activity

People

Assignee:: Unassigned

Reporter:: Nivas Umapathy

Shepherd:: liancheng

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Mar/21 19:10

Updated:: 26/Apr/21 15:57

Resolved:: 31/Mar/21 15:32