Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34750

Parquet with invalid chars on column name reads double as null when a clean schema is applied

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4.3
    • Fix Version/s: None
    • Component/s: Input/Output
    • Labels:
      None
    • Environment:

      Pyspark 2.4.3

      AWS Glue Dev Endpoint EMR

    • Target Version/s:

      Description

      I have a parquet file that has data with invalid column names on it. Reference(https://issues.apache.org/jira/browse/SPARK-27442)  Here is the file Invalid Header Parquet.

      I tried to load this file with 

      df = glue_context.read.parquet('invalid_columns_double.parquet')

      df = df.withColumnRenamed('COL 1', 'COL_1')

      df = df.withColumnRenamed('COL,2', 'COL_2')

      df = df.withColumnRenamed('COL;3', 'COL_3') 

      and so on.

      Now if i call

      df.show()

      it throws this exception that is still pointing to the old column name.

       pyspark.sql.utils.AnalysisException: 'Attribute name "COL 1" contains invalid character(s) among " ,;{}()\\n
      t=". Please use alias to rename it.;'

       

      When i read about it in some blogs, there was suggestion to re-read the same parquet with new schema applied. So i did 

      df = glue_context.read.schema(df.schema).parquet('invalid_columns_double.parquet'){{}}

       

      and it works, but all the data in the dataframe are null. The same works for Strings

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                toocoolblue2000 Nivas Umapathy
                Shepherd:
                liancheng
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: