Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-50525

Do not allow repartition by map

    XMLWordPrintableJSON

Details

    Description

       

      We allow users to repartition by a map column. This leads to incorrect results.

      // Create a sequence of maps that all have the same element, but a different insertion order.
      import scala.util.Random
      val elements = Seq.tabulate(4)(i => i -> s"v$i") // 4 elements uses a scala.collection.immutable.Map$Map4, this retains the insertion order.
      val maps = Seq.fill(10)(Random.shuffle(elements).toMap)
      
      // Check if they are all the same in scala land.
      assert(maps.distinct.size == 1)
      
      // This fails, which is good.
      maps.toDF.distinct.show()
      
      // This should return a single partition. However it returns multiple partitions.
      maps.toDF.repartition(4, $"value").groupBy(spark_partition_id()).show()
      
      // +--------------------+-----+
      // |SPARK_PARTITION_ID()|count|
      // +--------------------+-----+
      // |                   0|    2|
      // |                   1|    4|
      // |                   2|    2|
      // |                   3|    2|
      // +--------------------+-----+

       

       

      Attachments

        Activity

          People

            ostronaut Dmytro Tsyliuryk
            hvanhovell Herman van Hövell
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: