Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24347

df.alias() in python API should not clear metadata by default

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.3.0
    • None
    • PySpark

    Description

      currently when doing an alias on a column in pyspark I lose metadata:

      print("just select = ", df.select(col("v")).schema.fields[0].metadata.keys())
      print("select alias= ", df.select(col("v").alias("vv")).schema.fields[0].metadata.keys())

      gives:

      just select =  dict_keys(['ml_attr'])
      select alias=  dict_keys([])

      After looking at alias() documentation I see that metadata is an optional param. But it should not clear the metadata when it is not set. A default solution should be to keep it as-is.

      Otherwise - it generates problems in a later part of the processing pipeline when someone is depending on the metadata.

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            kretes Tomasz Bartczak
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: