Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Impala 2.9.0
-
None
-
None
Description
I noticed that with the clustered hint, we do the SORT right before the insert, but it's after the exchange (when shuffling).
For a simple ETL transformation (insert into tbl select * from src_tbl), the number of hosts doing to write is going to less than or equal to the host doing the scan. So, by doing the sort after the exchange, there's a risk of losing parallelism.
Using TPC-DS as an example, the Impala TPC-DS toolkit requires a ETL step where we re-partition the fact table according to the sales date. Sales date is skewed: some date has a lot more data then the other. Also, there are only 4k sales date. The data size might not even out across the whole cluster.
Attachments
Issue Links
- relates to
-
IMPALA-9951 Skew in analytic sorts when partition key has low cardinality
-
- Open
-