Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24940

Coalesce and Repartition Hint for SQL Queries

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.1
    • 2.4.0
    • SQL
    • None

    Description

      Many Spark SQL users in my company have asked for a way to control the number of output files in Spark SQL. The users prefer not to use function repartition(n) or coalesce(n) that require them to write and deploy Scala/Java/Python code.
        
      The DataFrame API has repartition/coalesce for a long time. However, we do not have an equivalent functionality in SQL queries. We propose adding the following Hive-style Coalesce and Repartition Hint to Spark SQL.

      INSERT ... SELECT /*+ COALESCE(numPartitions) */ ...
      INSERT ... SELECT /*+ REPARTITION(numPartitions) */ ...
      

      Hint names are case insensitive.

      Coalesce Hint reduces the number of partitions. It only merges partitions thus minimizes the data movement.

      Repartition Hint can either increase or decrease the number of partitions. It performs full shuffle of data and ensures data is equally distributed.

      Repartition adds a new stage, so it does not affect the parallelism of the existing stage. In contrast, Coalesce does affect the parallelism of the existing stage since it does not add a new stage.

      Multiple Inserts Queries and Named Subqueries are also supported.

      Attachments

        Activity

          People

            jzhuge John Zhuge
            jzhuge John Zhuge
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: