Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14044

Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.6.1
    • None
    • Spark Core
    • None

    Description

      It would be very useful to allow the disabling of this block of code within DynamicPartitionWriterContainer#writeRows at runtime:

      https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

      The use case is that an upstream groupBy has already sorted a great many fine grained groups which are the target of the partitionBy. This partitionBy shares the same keys as the groupBy. Currently, we can't even get Spark to succeed due to the sort step and data skew in the partitions. In general, this would make more efficient use of cluster resources.

      For this to work, there needs to be a way to communicate between the groupBy and the partitionBy by way of some runtime configuration. This is very similar in function to Hive's hive.optimize.sort.dynamic.partition parameter.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              btiernay Bob Tiernay
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: