Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31841

Dataset.repartition leverage adaptive execution

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None
    • Environment:

      spark branch-3.0 from may 1 this year

      Description

      hello,

      we are very happy users of adaptive query execution. its a great feature to now have to think about and tune the number of partitions anymore in a shuffle.

      i noticed that Dataset.groupBy consistently uses adaptive execution when its enabled (e.g. i don't see the default 200 partitions) but when i do Dataset.repartition it seems i am back to a hardcoded number of partitions.

      Should adaptive execution also be used for repartition? It would be nice to be able to repartition without having to think about optimal number of partitions.

      An example:

      $ spark-shell --conf spark.sql.adaptive.enabled=true --conf spark.sql.adaptive.advisoryPartitionSizeInBytes=100000
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
            /_/
               
      Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
      Type in expressions to have them evaluated.
      Type :help for more information.
      scala> val x = (1 to 1000000).toDF
      x: org.apache.spark.sql.DataFrame = [value: int]
      scala> x.rdd.getNumPartitions
      res0: Int = 2scala> x.repartition($"value").rdd.getNumPartitions
      res1: Int = 200
      scala> x.groupBy("value").count.rdd.getNumPartitions
      res2: Int = 67
      

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              koert koert kuipers

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment