Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-12995

Don't always sort data for Iceberg partitioned inserts

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Frontend
    • ghx-label-1

    Description

      Currently we always do a SHUFFLE and SORT before inserting data into a partitioned Iceberg table.
      With SHUFFLE, we can guarantee that each partition is assigned to a single sink operator, hence we can minimize the number of files being created.
      With SORT, we can write partitions one after the other, therefore we only need to write one file at a time (and only buffer data for that one file). This way we can avoid out of memory situations in the sink.

      SORT does a total ordering of the incoming records, therefore it can be very expensive. And when SORT needs to spill to disk, its fragment cannot receive incoming records, blocking the execution of the whole query cluster-wide (because after some time every sender is blocked on the receiving SORT fragment).

      In a lot of cases the SORT is not necessary because there's only a very few partitions assigned to each sink, especially in large clusters with high MT_DOP.

      During planning Impala should decide whether to add the SORT node for such partitioned INSERTs. Also, it should respect the optimizer hints CLUSTERED/NOCLUSTERED.
      https://impala.apache.org/docs/build/html/topics/impala_hints.html

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              boroknagyz Zoltán Borók-Nagy
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: