Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.1.2
-
None
Description
When I'm using the array_zip function in combination with renamed columns, I get an unexpected schema written to disk.
// code placeholder from pyspark.sql import * from pyspark.sql.functions import * spark = SparkSession.builder.getOrCreate() data = [ Row(a1=["a", "a"], b1=["b", "b"]), ] df = ( spark.sparkContext.parallelize(data).toDF() .withColumnRenamed("a1", "a2") .withColumnRenamed("b1", "b2") .withColumn("zipped", arrays_zip(col("a2"), col("b2"))) ) df.printSchema() // root // |-- a2: array (nullable = true) // | |-- element: string (containsNull = true) // |-- b2: array (nullable = true) // | |-- element: string (containsNull = true) // |-- zipped: array (nullable = true) // | |-- element: struct (containsNull = false) // | | |-- a2: string (nullable = true) // | | |-- b2: string (nullable = true) df.write.save("test.parquet") spark.read.load("test.parquet").printSchema() // root // |-- a2: array (nullable = true) // | |-- element: string (containsNull = true) // |-- b2: array (nullable = true) // | |-- element: string (containsNull = true) // |-- zipped: array (nullable = true) // | |-- element: struct (containsNull = true) // | | |-- a1: string (nullable = true) // | | |-- b1: string (nullable = true)
I would expect the schema of the DataFrame written to disk to be the same as that printed out. It seems that instead of using the renamed version of the column names, it uses the old column names.