Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4969

With clustered hint, consider sort->exhchange->insert plan

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Impala 2.9.0
    • Fix Version/s: None
    • Component/s: Frontend
    • Labels:
      None

      Description

      I noticed that with the clustered hint, we do the SORT right before the insert, but it's after the exchange (when shuffling).

      For a simple ETL transformation (insert into tbl select * from src_tbl), the number of hosts doing to write is going to less than or equal to the host doing the scan. So, by doing the sort after the exchange, there's a risk of losing parallelism.

      Using TPC-DS as an example, the Impala TPC-DS toolkit requires a ETL step where we re-partition the fact table according to the sales date. Sales date is skewed: some date has a lot more data then the other. Also, there are only 4k sales date. The data size might not even out across the whole cluster.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                alan@cloudera.com Alan Choi
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: