Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.4.3, 3.1.1
-
None
-
None
-
Pyspark 2.4.3
AWS Glue Dev Endpoint EMR
Description
I have a parquet file that has data with invalid column names on it. Reference(https://issues.apache.org/jira/browse/SPARK-27442) Here is the file attached with this ticket.
I tried to load this file with
df = glue_context.read.parquet('invalid_columns_double.parquet')
df = df.withColumnRenamed('COL 1', 'COL_1')
df = df.withColumnRenamed('COL,2', 'COL_2')
df = df.withColumnRenamed('COL;3', 'COL_3')
and so on.
Now if i call
df.show()
it throws this exception that is still pointing to the old column name.
pyspark.sql.utils.AnalysisException: 'Attribute name "COL 1" contains invalid character(s) among " ,;{}()
n
t=". Please use alias to rename it.;'
When i read about it in some blogs, there was suggestion to re-read the same parquet with new schema applied. So i did
df = glue_context.read.schema(df.schema).parquet('invalid_columns_double.parquet')
and it works, but all the data in the dataframe are null. The same works for String datatypes
Attachments
Attachments
Issue Links
- duplicates
-
SPARK-34750 Parquet with invalid chars on column name reads double as null when a clean schema is applied
- Open