Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14044

Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.6.1
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None

      Description

      It would be very useful to allow the disabling of this block of code within DynamicPartitionWriterContainer#writeRows at runtime:

      https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

      The use case is that an upstream groupBy has already sorted a great many fine grained groups which are the target of the partitionBy. This partitionBy shares the same keys as the groupBy. Currently, we can't even get Spark to succeed due to the sort step and data skew in the partitions. In general, this would make more efficient use of cluster resources.

      For this to work, there needs to be a way to communicate between the groupBy and the partitionBy by way of some runtime configuration. This is very similar in function to Hive's hive.optimize.sort.dynamic.partition parameter.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                btiernay Bob Tiernay
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: