Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
4.0.0
Description
We allow users to repartition by a map column. This leads to incorrect results.
// Create a sequence of maps that all have the same element, but a different insertion order. import scala.util.Random val elements = Seq.tabulate(4)(i => i -> s"v$i") // 4 elements uses a scala.collection.immutable.Map$Map4, this retains the insertion order. val maps = Seq.fill(10)(Random.shuffle(elements).toMap) // Check if they are all the same in scala land. assert(maps.distinct.size == 1) // This fails, which is good. maps.toDF.distinct.show() // This should return a single partition. However it returns multiple partitions. maps.toDF.repartition(4, $"value").groupBy(spark_partition_id()).show() // +--------------------+-----+ // |SPARK_PARTITION_ID()|count| // +--------------------+-----+ // | 0| 2| // | 1| 4| // | 2| 2| // | 3| 2| // +--------------------+-----+