Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34805

PySpark loses metadata in DataFrame fields when selecting nested columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.1, 3.1.1
    • 3.3.0
    • PySpark
    • None

    Description

      For a DataFrame schema with nested StructTypes, where metadata is set for fields in the schema, that metadata is lost when a DataFrame selects nested fields.  For example, suppose

      df.schema.fields[0].dataType.fields[0].metadata
      

      returns a non-empty dictionary, then

      df.select('Field0.SubField0').schema.fields[0].metadata

      returns an empty dictionary, where "Field0" is the name of the first field in the DataFrame and "SubField0" is the name of the first nested field under "Field0".

       

      Attachments

        1. jsonMetadataTest.py
          2 kB
          Mark Ressler
        2. nested_columns_metadata.scala
          0.8 kB
          Kevin Wallimann

        Activity

          People

            Unassigned Unassigned
            Pyrrho Mark Ressler
            Votes:
            4 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: