Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34750

Parquet with invalid chars on column name reads double as null when a clean schema is applied

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.3
    • None
    • Input/Output
    • None
    • Pyspark 2.4.3

      AWS Glue Dev Endpoint EMR

    Description

      I have a parquet file that has data with invalid column names on it. Reference(https://issues.apache.org/jira/browse/SPARK-27442)  Here is the file Invalid Header Parquet.

      I tried to load this file with 

      df = glue_context.read.parquet('invalid_columns_double.parquet')

      df = df.withColumnRenamed('COL 1', 'COL_1')

      df = df.withColumnRenamed('COL,2', 'COL_2')

      df = df.withColumnRenamed('COL;3', 'COL_3') 

      and so on.

      Now if i call

      df.show()

      it throws this exception that is still pointing to the old column name.

       pyspark.sql.utils.AnalysisException: 'Attribute name "COL 1" contains invalid character(s) among " ,;{}()\\n
      t=". Please use alias to rename it.;'

       

      When i read about it in some blogs, there was suggestion to re-read the same parquet with new schema applied. So i did 

      df = glue_context.read.schema(df.schema).parquet('invalid_columns_double.parquet'){{}}

       

      and it works, but all the data in the dataframe are null. The same works for Strings

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              toocoolblue2000 Nivas Umapathy
              liancheng liancheng
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: