[KUDU-2670] Splitting more tasks for spark job, and add more concurrent for scan operation - ASF JIRA

XML

Word

Printable

JSON

Refer to the ~~KUDU-2437~~ Split a tablet into primary key ranges by size.

We need a java client implementation to support the split the tablet scan operation.

We suggest two new implementation for the java client.

A ConcurrentKuduScanner to get more scanner read data at the same time. This will be useful for one case. We scanner only one row, but the predicate doesn't contain the primary key, for this case, we will send a lot scanner request but only one row return.It will be slow to send so much scanner request one by one. So we need a concurrent way. And by this case we test, for a 10G tablet, it will save a lot time for one machine.
A way to split more spark task. To do so, we need get scanner tokens for two step, first we send to the tserver to give range, then with this range we get more scanner tokens. For our usage we make a tablet 10G, but we split a task to process only 1G data. So we get better performance.

And all this feature has run well for us for half a year. We hope this feature will be useful for the community.

is depended upon by

KUDU-2785 Support more parallel scanners in the backup job

is related to

KUDU-2917 Split a tablet into primary key ranges by number of row

relates to

IMPALA-9792 Split Kudu scan ranges into smaller chunks for greater paralellelism