Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4969

With clustered hint, consider sort->exhchange->insert plan

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 2.9.0
    • None
    • Frontend
    • None

    Description

      I noticed that with the clustered hint, we do the SORT right before the insert, but it's after the exchange (when shuffling).

      For a simple ETL transformation (insert into tbl select * from src_tbl), the number of hosts doing to write is going to less than or equal to the host doing the scan. So, by doing the sort after the exchange, there's a risk of losing parallelism.

      Using TPC-DS as an example, the Impala TPC-DS toolkit requires a ETL step where we re-partition the fact table according to the sales date. Sales date is skewed: some date has a lot more data then the other. Also, there are only 4k sales date. The data size might not even out across the whole cluster.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              alan@cloudera.com Alan Choi
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: