Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19256 Hive bucketing write support
  3. SPARK-26164

[SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.4.0, 3.0.0, 3.1.0
    • 3.2.0
    • SQL
    • None

    Description

      Problem:

      Current spark always requires a local sort before writing to output table on partition/bucket columns [1]. The disadvantage is the sort might waste reserved CPU time on executor due to spill. Hive does not require the local sort before writing output table [2], and we saw performance regression when migrating hive workload to spark.

       

      Proposal:

      We can avoid the local sort by keeping the mapping between file path and output writer. In case of writing row to a new file path, we create a new output writer. Otherwise, re-use the same output writer if the writer already exists (mainly change should be in FileFormatDataWriter.scala). This is very similar to what hive does in [2].

      Given the new behavior (i.e. avoid sort by keeping multiple output writer) consumes more memory on executor (multiple output writer needs to be opened in same time), than the current behavior (i.e. only one output writer opened). We can add the config to switch between the current and new behavior.

       

      [1]: spark FileFormatWriter.scala - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123

      [2]: hive FileSinkOperator.java - https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510

       

       

      Attachments

        Activity

          People

            chengsu Cheng Su
            chengsu Cheng Su
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: