Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: Core
    • Labels:

      Description

      With large rows, it would be nice to not have to send an entire row if a small part is out of sync. Could we use the row index blocks as repair atoms instead of the full row?

        Issue Links

          Activity

          Jonathan Ellis created issue -
          Hide
          Sylvain Lebresne added a comment -

          Right now we're using tokens range to create the merkle tree, so we cannot repair less than a token without major changes to repair.
          Besides, repair needs to use the same "atoms" on all the node it repairs, so I don't think the row index blocks would qualify since they differ from node to node.

          Overall, I don't see how that can be done with the current repair.

          Show
          Sylvain Lebresne added a comment - Right now we're using tokens range to create the merkle tree, so we cannot repair less than a token without major changes to repair. Besides, repair needs to use the same "atoms" on all the node it repairs, so I don't think the row index blocks would qualify since they differ from node to node. Overall, I don't see how that can be done with the current repair.
          Hide
          Jonathan Ellis added a comment -

          That's a pretty big ouch for wide-row data models. If you're doing tens of appends per second to one of those, the odds are pretty good that your merkle trees will be out of sync at any given instant, and you end up streaming the entire row.

          Show
          Jonathan Ellis added a comment - That's a pretty big ouch for wide-row data models. If you're doing tens of appends per second to one of those, the odds are pretty good that your merkle trees will be out of sync at any given instant, and you end up streaming the entire row.
          Hide
          Sylvain Lebresne added a comment -

          For the record, I certainly don't pretend it's a good thing. I would even add that it will also be a problem in the hypothesis of CASSANDRA-1684.

          Show
          Sylvain Lebresne added a comment - For the record, I certainly don't pretend it's a good thing. I would even add that it will also be a problem in the hypothesis of CASSANDRA-1684 .
          Jonathan Ellis made changes -
          Field Original Value New Value
          Priority Minor [ 4 ] Major [ 3 ]
          Sylvain Lebresne made changes -
          Assignee Sylvain Lebresne [ slebresne ]
          Hide
          Sylvain Lebresne added a comment -

          Un-assigning myself for now as I have 0 idea on how to do that. As said previously, I'm skeptical that our current repair is compatible with this so imo this ticket amounts to redo repair pretty much completely.

          Show
          Sylvain Lebresne added a comment - Un-assigning myself for now as I have 0 idea on how to do that. As said previously, I'm skeptical that our current repair is compatible with this so imo this ticket amounts to redo repair pretty much completely.
          Hide
          Jonathan Ellis added a comment -

          Notes from chat:

          When we repair [0, 1000], we agree on some level for the merkle tree, say 2, and we say the merkle tree leaves will be [0, 250], [250, 500], [500, 750], [750, 1000]
          then each node calculate the hash for those leave base on their keys, and we compare.

          We could make it a two step process, where everyone starts w/ the power of 2 tree, but then A can say "i have row 10 with a billion columns, let's subdivide [0, 250] into [0, (10, 500000000)] and [(10, 500000000), 250].

          The drawback then is that you will do a first validation pass to agree on the subdivisions, then another to compute the actual hashes.

          Or, we could first do a merkle tree as we do now, then for the ranges that differ, if we know they cover lots of columns (which can be computed easily initially), we could compute smaller hash ranges before streaming anything. You'd still read everything twice in the worst case, but if most rows are small then you don't need to read much the second time.

          In the meantime, if you can shard your huge rows instead at the app level that will work better.

          Show
          Jonathan Ellis added a comment - Notes from chat: When we repair [0, 1000] , we agree on some level for the merkle tree, say 2, and we say the merkle tree leaves will be [0, 250] , [250, 500] , [500, 750] , [750, 1000] then each node calculate the hash for those leave base on their keys, and we compare. We could make it a two step process, where everyone starts w/ the power of 2 tree, but then A can say "i have row 10 with a billion columns, let's subdivide [0, 250] into [0, (10, 500000000)] and [(10, 500000000), 250] . The drawback then is that you will do a first validation pass to agree on the subdivisions, then another to compute the actual hashes. Or, we could first do a merkle tree as we do now, then for the ranges that differ, if we know they cover lots of columns (which can be computed easily initially), we could compute smaller hash ranges before streaming anything. You'd still read everything twice in the worst case, but if most rows are small then you don't need to read much the second time. In the meantime, if you can shard your huge rows instead at the app level that will work better.
          Jonathan Ellis made changes -
          Fix Version/s 2.0 [ 12322954 ]
          Gavin made changes -
          Workflow no-reopen-closed, patch-avail [ 12637883 ] patch-available, re-open possible [ 12749701 ]
          Gavin made changes -
          Workflow patch-available, re-open possible [ 12749701 ] reopen-resolved, no closed status, patch-avail, testing [ 12757188 ]
          Brandon Williams made changes -
          Link This issue is duplicated by CASSANDRA-5419 [ CASSANDRA-5419 ]
          Jonathan Ellis made changes -
          Fix Version/s 2.1 [ 12324159 ]
          Fix Version/s 2.0 [ 12322954 ]
          Sylvain Lebresne made changes -
          Fix Version/s 2.1 beta2 [ 12326276 ]
          Fix Version/s 2.1 [ 12324159 ]
          Jonathan Ellis made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 2.1 beta2 [ 12326276 ]
          Resolution Won't Fix [ 2 ]
          Hide
          Jonathan Ellis added a comment -

          I don't think this will be necessary with CASSANDRA-5351 completed.

          Show
          Jonathan Ellis added a comment - I don't think this will be necessary with CASSANDRA-5351 completed.
          Aleksey Yeschenko made changes -
          Link This issue is related to CASSANDRA-8911 [ CASSANDRA-8911 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          880d 22h 33m 1 Jonathan Ellis 13/Mar/14 21:14

            People

            • Assignee:
              Unassigned
              Reporter:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development