Many Spark SQL users in my company have asked for a way to control the number of output files in Spark SQL. The users prefer not to use function repartition(n) or coalesce(n) that require them to write and deploy Scala/Java/Python code.
The DataFrame API has repartition/coalesce for a long time. However, we do not have an equivalent functionality in SQL queries. We propose adding the following Hive-style Coalesce and Repartition Hint to Spark SQL.
INSERT ... SELECT /*+ COALESCE(numPartitions) */ ... INSERT ... SELECT /*+ REPARTITION(numPartitions) */ ...
Hint names are case insensitive.
Coalesce Hint reduces the number of partitions. It only merges partitions thus minimizes the data movement.
Repartition Hint can either increase or decrease the number of partitions. It performs full shuffle of data and ensures data is equally distributed.
Repartition adds a new stage, so it does not affect the parallelism of the existing stage. In contrast, Coalesce does affect the parallelism of the existing stage since it does not add a new stage.
Multiple Inserts Queries and Named Subqueries are also supported.