[PHOENIX-2126] Improving performance of merge sort by multi-threaded and minheap implementation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: 4.1.0, 4.2.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

CREATE TABLE IF NOT EXISTS test (
dim1 INTEGER NOT NULL,
A.B INTEGER,
A.M DECIMAL,
CONSTRAINT PK PRIMARY KEY
(dim1))
SALT_BUCKETS =256,DEFAULT_COLUMN_FAMILY='A';

Query to benchmark:-

select dim1,sum(b),sum(m) from test where Datemth>=201505 and Datemth<=201505 and dim1 IS NOT NULL  group by dim1 order by sum(m) desc nulls last limit 10;

current scenario:-

*CASE 1: * consider the case when dim1 is high cardinality attribute (10K+) and table have salt bucket set to 256, we will get 256 iterators from above query at the client and MergeSortRowKeyResultIterator has to merge these 256 iterators with single thread. So let's say each iterator has 10k tuples returned, then merge sort needs to merge 2.5M tuples which will be costly if it is done with single thread and the query spend most of its time on client

*CASE 2: * consider the case when dim1 is high cardinality attribute (10K+) and table have salt bucket set to 1, we will get 1 iterator from above query at the client and MergeSortRowKeyResultIterator doesn't need to merge anything. Here, it is fine with single thread.

*CASE 3: * consider the case when dim1 is low cardinality attribute (10-100) and table have salt bucket set to 256, we will get 256 iterator from above query at the client and MergeSortRowKeyResultIterator has to merge these 256 iterators with single thread. here the single thread is also fine as he has to merge only 2560 tuples.

Solution for case1 problem is:-

Optimized the implementation of merging 'n'-sorted iterators(having 'm' tuples) by using "min heap" which optimizes the time complexity from
O(n2m) to O(nmLogn) (as heapify takes (Logn) time).

And, By using multiple-threads('t') to process group of iterators which further optimized the complexity to

T(nm)=T(nm)/t+T(t)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PHOENIX-2126_v1.0.patch
17/Jul/15 22:31
28 kB
Ankit Singhal
PHOENIX-2126_v2.0.patch
17/Aug/15 14:55
32 kB
Ankit Singhal
PHOENIX-2126_v3.patch
24/Oct/15 10:03
33 kB
Ankit Singhal

Sub-Tasks

1.	Use a priority queue in MergeSortResultIterator	Resolved	Ankit Singhal
2.	Use your parallelized two level heap implementation to perform a client-side sort	Open	Ankit Singhal
3.	Fix test failures due to use of priority queue for merge sort	Resolved	Ankit Singhal

Activity

People

Assignee:: Ankit Singhal

Reporter:: Ankit Singhal

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 17/Jul/15 21:49

Updated:: 11/Nov/15 00:54