Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Merging bloom filters in semijoin reduction can become the main bottleneck in case of large number of source mapper tasks (~1000, Map 1 in below example) and a large amount of expected entries (50M) in bloom filters.
For example in TPCDS Q93:
select /*+ semi(store_returns, sr_item_sk, store_sales, 70000000)*/ ss_customer_sk ,sum(act_sales) sumsales from (select ss_item_sk ,ss_ticket_number ,ss_customer_sk ,case when sr_return_quantity is not null then (ss_quantity-sr_return_quantity)*ss_sales_price else (ss_quantity*ss_sales_price) end act_sales from store_sales left outer join store_returns on (sr_item_sk = ss_item_sk and sr_ticket_number = ss_ticket_number) ,reason where sr_reason_sk = r_reason_sk and r_reason_desc = 'reason 66') t group by ss_customer_sk order by sumsales, ss_customer_sk limit 100;
On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 mins are spent with merging bloom filters (Reducer 2), as in: lipwig-output3605036885489193068.svg
---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 3 .......... llap SUCCEEDED 1 1 0 0 0 0 Map 1 .......... llap SUCCEEDED 1263 1263 0 0 0 0 Reducer 2 llap RUNNING 1 0 1 0 0 0 Map 4 llap RUNNING 6154 0 207 5947 0 0 Reducer 5 llap INITED 43 0 0 43 0 0 Reducer 6 llap INITED 1 0 0 1 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/06 [====>>----------------------] 16% ELAPSED TIME: 149.98 s ----------------------------------------------------------------------------------------------
For example, 70M entries in bloom filter leads to a 436 465 696 bits, so merging 1263 bloom filters means running ~ 1263 * 436 465 696 bitwise OR operation, which is very hot codepath, but can be parallelized.
Attachments
Attachments
Issue Links
- causes
-
HIVE-26655 VectorUDAFBloomFilterMerge should take care of safe batch handling when working in parallel
- Closed
- links to