Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31794

Incorrect distribution with repartitionByRange and repartition column expression

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.3.2, 2.4.5, 3.0.1
    • None
    • Spark Core

    Description

      Both repartitionByRange and  repartition(<num>, <column>)  resulting in wrong distribution within the resulting partition.  

       

      In the Range partition one of the partition has 2x volume and last one with zero.  In repartition this is more problematic with some partition with 4x, 2x the avg and many partitions with zero volume.  

       

      This distribution imbalance can cause performance problem in a concurrent environment.

      Details from testing in 3 different versions.

      Verion 2.3.2 Version 2.4.5 Versoin 3.0 Preview2
      Spark Version 2.3.2.3.1.4.0-315 Spark Version 2.4.5 Spark Version 3.0.0-preview2
      Default Partition Length:2 Default Partition Length:2 Default Partition Length:80
      Default Partition getNumPartitions:2 Default Partition getNumPartitions:2 Default Partition getNumPartitions:80
      Default Partition groupBy spark_partition_id:200 Default Partition groupBy spark_partition_id:200 Default Partition groupBy spark_partition_id:200
      repartitionByRange Length:24 repartitionByRange Length:24 repartitionByRange Length:24
      repartitionByRange getNumPartitions:24 repartitionByRange getNumPartitions:24 repartitionByRange getNumPartitions:24
      repartitionByRange groupBy spark_partition_id:200 repartitionByRange groupBy spark_partition_id:200 repartitionByRange groupBy spark_partition_id:200
      repartitionByRange: List(83, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 41, 41, 41, 41, 41, 41, 0) repartitionByRange: List(83, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 41, 41, 41, 41, 41, 41, 0) repartitionByRange: List(83, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 41, 41, 41, 41, 41, 41, 0)
      repartition by column expr Length:24 repartition by column expr Length:24 repartition by column expr Length:24
      repartition by column expr getNumPartitions:24 repartition by column expr getNumPartitions:24 repartition by column expr getNumPartitions:24
      repartition by column expr groupBy spark_partitoin_id:200 repartition by column expr groupBy spark_partitoin_id:200 repartition by column expr groupBy spark_partitoin_id:200
      repartition by column expr:List(83, 42, 0, 84, 0, 42, 125, 0, 42, 84, 0, 42, 0, 82, 0, 124, 42, 83, 84, 42, 0, 0, 0, 0) repartition by column expr:List(83, 42, 0, 84, 0, 42, 125, 0, 42, 84, 0, 42, 0, 82, 0, 124, 42, 83, 84, 42, 0, 0, 0, 0) repartition by column expr:List(83, 42, 0, 84, 0, 42, 125, 0, 42, 84, 0, 42, 0, 82, 0, 124, 42, 83, 84, 42, 0, 0, 0, 0)

      Attachments

        Activity

          People

            Unassigned Unassigned
            rbhatta Ramesha Bhatta
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: