Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.3.0
-
None
Description
currently when doing an alias on a column in pyspark I lose metadata:
print("just select = ", df.select(col("v")).schema.fields[0].metadata.keys()) print("select alias= ", df.select(col("v").alias("vv")).schema.fields[0].metadata.keys())
gives:
just select = dict_keys(['ml_attr'])
select alias= dict_keys([])
After looking at alias() documentation I see that metadata is an optional param. But it should not clear the metadata when it is not set. A default solution should be to keep it as-is.
Otherwise - it generates problems in a later part of the processing pipeline when someone is depending on the metadata.