Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34751

Parquet with invalid chars on column name reads double as null when a clean schema is applied

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.4.3, 3.1.1
    • None
    • Input/Output
    • None
    • Pyspark 2.4.3

      AWS Glue Dev Endpoint EMR

    Description

      I have a parquet file that has data with invalid column names on it. Reference(https://issues.apache.org/jira/browse/SPARK-27442)  Here is the file attached with this ticket.

      I tried to load this file with 

      df = glue_context.read.parquet('invalid_columns_double.parquet')

      df = df.withColumnRenamed('COL 1', 'COL_1')

      df = df.withColumnRenamed('COL,2', 'COL_2')

      df = df.withColumnRenamed('COL;3', 'COL_3') 

      and so on.

      Now if i call

      df.show()

      it throws this exception that is still pointing to the old column name.

       pyspark.sql.utils.AnalysisException: 'Attribute name "COL 1" contains invalid character(s) among " ,;{}()
      n
      t=". Please use alias to rename it.;'

       

      When i read about it in some blogs, there was suggestion to re-read the same parquet with new schema applied. So i did 

      df = glue_context.read.schema(df.schema).parquet('invalid_columns_double.parquet')

       

      and it works, but all the data in the dataframe are null. The same works for String datatypes

       

      Attachments

        1. invalid_columns_double.parquet
          5 kB
          Nivas Umapathy

        Issue Links

          Activity

            People

              Unassigned Unassigned
              toocoolblue2000 Nivas Umapathy
              liancheng liancheng
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: