Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-15774

Improve range reads to query by endpoints instead of vnodes to reduce number of remote requests

    XMLWordPrintableJSON

Details

    • Performance
    • Challenging
    • All
    • None

    Description

      Currently, range read queries in batches, see StorageProxy.RangeCommandIterator#sendNextRequests(). For each batch, it computes a list of merged vnode ranges up to concurrency factor and query each merged vnode range asynchronously. (note: consecutive vnode ranges can be merged if they share enough replicas to satisfy consistency level requirement)

      This works fine in general, but when concurrency factor is high because returned row count is small comparing to query limit or index filtering is used, coordinator may send too many concurrent remote range requests in a batch.

      We can improve it by grouping remote range requests by endpoints where each endpoint will return response corresponding to multiple non-consecutive ranges. With endpoint grouping, number of remote range requests should largely reduced and it's always capped by number of nodes in the cluster instead of number of ranges which is capped by concurrency factor.

      Let's look at an example on a 5-node cluster with 10 ranges(a,b,c,d,e,f,g,h,i,h) and rf3.

      Following is the range to replica mapping using round robin that should work well with consecutive range merger (consecutive range merger doesn't work well with fully random replica mapping, because it's less likely to have overlapping replicas for consecutive ranges)

         range-a replicas: 1, 2, 3
         range-b replicas: 2, 3, 4
         range-c replicas: 3, 4, 5
         range-d replicas: 1, 4, 5
         range-e replicas: 1, 2, 5
         range-f replicas: 1, 2, 3
         range-g replicas: 2, 3, 4
         range-h replicas: 3, 4, 5
         range-i replicas: 1, 4, 5
         range-j replicas: 1, 2, 5
      

      With default range read implementation and consecutive range merger, we need 10 replica read requests(2 for each merged range) for quorum:

           range (a,b] on node [2, 3]
           range (c,d] on node [4, 5]
           range (e,f] on node [1, 2]
           range (g,h] on node [3, 4]
           range (i,j] on node [1, 5]
      

      With group query by endpoints, we only need 4 replica read requests for quorum:

          * node 1: a, d, e, f, i, j
          * node 2: a, b, e, f, g, j
          * node 3: b, c, g, h
          * node 4: c, d, h, i
      

       
      Note that there are some complexities around short-read protection which needs to know whether replica has more rows available for current range.
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            jasonstack Zhao Yang
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: