Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2670

Splitting more tasks for spark job, and add more concurrent for scan operation



    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.8.0
    • Fix Version/s: None
    • Component/s: java, spark
    • Labels:


      Refer to the KUDU-2437 Split a tablet into primary key ranges by size.

      We need a java client implementation to support the split the tablet scan operation.

      We suggest two new implementation for the java client.

      1. A ConcurrentKuduScanner to get more scanner read data at the same time. This will be useful for one case.  We scanner only one row, but the predicate doesn't contain the primary key, for this case, we will send a lot scanner request but only one row return.It will be slow to send so much scanner request one by one. So we need a concurrent way. And by this case we test, for a 10G tablet, it will save a lot time for one machine.
      2. A way to split more spark task. To do so, we need get scanner tokens for two step, first we send to the tserver to give range, then with this range we get more scanner tokens. For our usage we make a tablet 10G, but we split a task to process only 1G data. So we get better performance.

      And all this feature has run well for us for half a year. We hope this feature will be useful for the community.


          Issue Links



              • Assignee:
                oclarms Xu Yao
                yangz yangz
              • Votes:
                0 Vote for this issue
                7 Start watching this issue


                • Created: