Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-6506

counters++ split counter context shards into separate cells

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: 4.x
    • Component/s: None
    • Labels:
      None

      Description

      This change is related to, but somewhat orthogonal to CASSANDRA-6504.

      Currently all the shard tuples for a given counter cell are packed, in sorted order, in one binary blob. Thus reconciling N counter cells requires allocating a new byte buffer capable of holding the union of the two context's shards N-1 times.

      For writes, in post CASSANDRA-6504 world, it also means reading more data than we have to (the complete context, when all we need is the local node's global shard).

      Splitting the context into separate cells, one cell per shard, will help to improve this. We did a similar thing with super columns for CASSANDRA-3237. Incidentally, doing this split is now possible thanks to CASSANDRA-3237.

      Doing this would also simplify counter reconciliation logic. Getting rid of old contexts altogether can be done trivially with upgradesstables.

      In fact, we should be able to put the logical clock into the cell's timestamp, and use regular Cell-s and regular Cell reconcile() logic for the shards, especially once we get rid of the local/remote shards some time in the future (until then we still have to differentiate between global/remote/local shards and their priority rules).

        Issue Links

          Activity

          Hide
          michaelsembwever mck added a comment -

          Bumping to fix version 4.x, as 3.11.0 is a bug-fix only release.
            ref https://s.apache.org/EHBy

          Show
          michaelsembwever mck added a comment - Bumping to fix version 4.x, as 3.11.0 is a bug-fix only release.   ref https://s.apache.org/EHBy
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          abel lin this has been explored before. See the comments for CASSANDRA-4775 (TL;DR - it's not going to happen, for multiple serious reasons).

          Show
          iamaleksey Aleksey Yeschenko added a comment - abel lin this has been explored before. See the comments for CASSANDRA-4775 (TL;DR - it's not going to happen, for multiple serious reasons).
          Hide
          aabbeell abel lin added a comment -

          Hi all,

          I am not sure if my question is proper here. But my question is about idempotent counters.
          After reading some threads about counters, I found the idea of replay id seems to solve the
          problem of over-count. But I don't see more discussion about this. If we can use the replay id
          along with timestamp, the counter updated among replicas can be done well even if request again.
          As for extra informations to save the replay id and timestamp, I think it's not a problem.
          We can preserve these for a period of time. The time can be reconfigured.
          These info can be deleted if over the preserved time.

          I am concerned about the counter could be idempotent and how to deal with this in the future.

          Much appreciated.

          Show
          aabbeell abel lin added a comment - Hi all, I am not sure if my question is proper here. But my question is about idempotent counters. After reading some threads about counters, I found the idea of replay id seems to solve the problem of over-count. But I don't see more discussion about this. If we can use the replay id along with timestamp, the counter updated among replicas can be done well even if request again. As for extra informations to save the replay id and timestamp, I think it's not a problem. We can preserve these for a period of time. The time can be reconfigured. These info can be deleted if over the preserved time. I am concerned about the counter could be idempotent and how to deal with this in the future. Much appreciated.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Anyway, committed the first two, changed fixver to 3.0.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Anyway, committed the first two, changed fixver to 3.0.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          That's one of the differences between 2.0 and 2.1 - in 2.0 we localCopy() first, then reconcile(), and thus have to use the same allocator all the way, but in 2.1 we reconcile() first, and localCopy() the result.

          Show
          iamaleksey Aleksey Yeschenko added a comment - That's one of the differences between 2.0 and 2.1 - in 2.0 we localCopy() first, then reconcile(), and thus have to use the same allocator all the way, but in 2.1 we reconcile() first, and localCopy() the result.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Yeah, those. We don't need to pass it, because we don't really want to use non-HeapAllocator for merging counter contexts - these objects are extremely short-lived. The important bit is the localCopy() allocator, and there we do use the configured memtable allocator.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Yeah, those. We don't need to pass it, because we don't really want to use non-HeapAllocator for merging counter contexts - these objects are extremely short-lived. The important bit is the localCopy() allocator, and there we do use the configured memtable allocator.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Are we talking about "6506: Clean up CFMetaData" and "6506: Clean up Cell/OnDiskAtom"? Asking because I can't seem to match the commit hash of your comment to the 2nd one in particular (nor git was able to find said commit hash on your branch from above). Anyway, if that's the ones, definitively no objections on the first one. On the second one, shouldn't we continue passing the allocator down to CounterContext up until we do this properly? (but I'm good with the min/maxTimestamp refactoring part).

          Show
          slebresne Sylvain Lebresne added a comment - Are we talking about "6506: Clean up CFMetaData" and "6506: Clean up Cell/OnDiskAtom"? Asking because I can't seem to match the commit hash of your comment to the 2nd one in particular (nor git was able to find said commit hash on your branch from above). Anyway, if that's the ones, definitively no objections on the first one. On the second one, shouldn't we continue passing the allocator down to CounterContext up until we do this properly? (but I'm good with the min/maxTimestamp refactoring part).
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Any objections to at least committing the first two cleanup commits now (the first one is there to mostly kill all the IDEA warnings, at last, but the second one - c503d6ae89651186b9ac7fc8026eab0ace137af - does deconfuse the API a bit) ?

          Show
          iamaleksey Aleksey Yeschenko added a comment - Any objections to at least committing the first two cleanup commits now (the first one is there to mostly kill all the IDEA warnings, at last, but the second one - c503d6ae89651186b9ac7fc8026eab0ace137af - does deconfuse the API a bit) ?
          Hide
          slebresne Sylvain Lebresne added a comment -

          With CASSANDRA-6717 this should become a non-issue, but that's 3.0

          Well, that's one more reason why rushing this in 2.1 is possibly not the best option. But truly, I'm not sure we absolutely need CASSANDRA-6717 for this if we really don't want to. It's simple enough to have a boolean flag to distinguish between dense and non-dense in 2.1 if we really want to. Even if that flag get replaced by something else in CASSANDRA-6717, that's still feel cleaner to me that having specific CellNameType implementation, originalType(), etc...

          (see the last set two sets of graphs in CASSANDRA-6553, where 6556 writes are a lot smoother than writes w/out them). That and getting rid of CASSANDRA-6405 for good.

          Fair enough, I'm not saying the end goal is wrong. But tbh, it feels like the patch currently add more complexity than it removes, and I'm really bugged about having special cellName and cellNameType implementations for counters. We also already have done lots of changes to counter in 2.1, I'm just not sure adding another (definitively-not-small) layer of changes on top of that a short time before release is the best strategy to minimize the change of breaking things. Let's say that my gut feeling is that we leave that for 3.0, and use that opportunity to try to simplify that further (CASSANDRA-6717, maybe getting rid of local/remote).

          but I seriously don't see how this could be accomplished

          I think the first step is CASSANDRA-6888. If we have that, then in 3.0 we can at least detect if there is remaining local/remote shard in the system. If there is, one idea could be to ask (force really) people to run some offline upgrade tool. Or run it for them at startup really. I know it's not ideal but well, this is just to say that it's possible, and this may be worth the effort truly.

          hence not holding my breath for implementing counters as maps

          That's not related to the local/remote shards, is it? Seems to me that we could do that even if we still have CounterCell.

          That's because using the timestamp field for the logical clock breaks the re-adding of previously dropped counter cells

          Good point, but that's kind of not a detail imo. But now that we do read-modify-write, we could really use the current time for the timestamp can't we? We'd just have to make sure the times assigned by a local node never go back in time, and add +1 here and there, but that's not too hard. This could also help make writeTime() work with counters, which I believe work currently and is broken by this patch.

          Show
          slebresne Sylvain Lebresne added a comment - With CASSANDRA-6717 this should become a non-issue, but that's 3.0 Well, that's one more reason why rushing this in 2.1 is possibly not the best option. But truly, I'm not sure we absolutely need CASSANDRA-6717 for this if we really don't want to. It's simple enough to have a boolean flag to distinguish between dense and non-dense in 2.1 if we really want to. Even if that flag get replaced by something else in CASSANDRA-6717 , that's still feel cleaner to me that having specific CellNameType implementation, originalType(), etc... (see the last set two sets of graphs in CASSANDRA-6553 , where 6556 writes are a lot smoother than writes w/out them). That and getting rid of CASSANDRA-6405 for good. Fair enough, I'm not saying the end goal is wrong. But tbh, it feels like the patch currently add more complexity than it removes, and I'm really bugged about having special cellName and cellNameType implementations for counters. We also already have done lots of changes to counter in 2.1, I'm just not sure adding another (definitively-not-small) layer of changes on top of that a short time before release is the best strategy to minimize the change of breaking things. Let's say that my gut feeling is that we leave that for 3.0, and use that opportunity to try to simplify that further ( CASSANDRA-6717 , maybe getting rid of local/remote). but I seriously don't see how this could be accomplished I think the first step is CASSANDRA-6888 . If we have that, then in 3.0 we can at least detect if there is remaining local/remote shard in the system. If there is, one idea could be to ask (force really) people to run some offline upgrade tool. Or run it for them at startup really. I know it's not ideal but well, this is just to say that it's possible, and this may be worth the effort truly. hence not holding my breath for implementing counters as maps That's not related to the local/remote shards, is it? Seems to me that we could do that even if we still have CounterCell. That's because using the timestamp field for the logical clock breaks the re-adding of previously dropped counter cells Good point, but that's kind of not a detail imo. But now that we do read-modify-write, we could really use the current time for the timestamp can't we? We'd just have to make sure the times assigned by a local node never go back in time, and add +1 here and there, but that's not too hard. This could also help make writeTime() work with counters, which I believe work currently and is broken by this patch.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Once we have CASSANDRA-6717, I pinky-promise to get rid of most of the stuff that's bothering you in 3.0 (all new CellName methods, originalType() and enableCounters()). (I can kill the new CellName methods right now, really).

          I honestly have no clue wrt getting rid of local and remote shards, which is why I'm fine with this current approach in general. I would love to, really, to just stick to maps, but I just don't see how we can get rid of those.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Once we have CASSANDRA-6717 , I pinky-promise to get rid of most of the stuff that's bothering you in 3.0 (all new CellName methods, originalType() and enableCounters()). (I can kill the new CellName methods right now, really). I honestly have no clue wrt getting rid of local and remote shards, which is why I'm fine with this current approach in general. I would love to, really, to just stick to maps, but I just don't see how we can get rid of those.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I'm also uncomfortable with originalType() (and enableCounters() to some extend). It feels pretty error prone as you need to think hard about whether you you call that first before calling some other method or not. It feels to me that again, we shouldn't need anything special with CellNameType for counters: thrift/CQL should convert things to/from the "internal" format, but as far as CellName/CellNameType are concerned, we shouldn't need anything special casing at all (I do understand that we probably need to start storing whether the table is dense or not in the system table for thrift to convert things properly, but it's high time we do that anyway).

          Mostly, they are there indeed to be able to distinguish between the original comparator being CompositeType(UTF8Type) and just UTF8Type, for example. With CASSANDRA-6717 this should become a non-issue, but that's 3.0, and we can always get rid of originalType() and enableCounters() later.

          Also, tbh, I'm starting to wonder if 2.1 is such a reasonable target for this at this point. Especially given we can't get rid of CounterCell right now (I think we should seriously consider getting rid of local/remote shard for 3.0) this feels like a lot of changes to push at the last minute, and it doesn't felt like it brings so much to the table that it can't wait 3.0. Especially since in 3.0 it might be achievable to get rid of CounterCell once and for all (and we wouldn't have to care about CQL2).

          Maybe. I wanted this in 2.1 to sweeten the upgrade to new counters by improving performance where we can (see the last set two sets of graphs in CASSANDRA-6553, where 6556 writes are a lot smoother than writes w/out them). That and getting rid of CASSANDRA-6405 for good. Current reconcile code should die, really.

          I think we should seriously consider getting rid of local/remote shard for 3.0

          We should, but I seriously don't see how this could be accomplished - hence not holding my breath for implementing counters as maps, for example.

          I believe disallowing the drop of columns for CQL3 counter tables is a regression. Not sure why the patch does it tbh.

          That's because using the timestamp field for the logical clock breaks the re-adding of previously dropped counter cells.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I'm also uncomfortable with originalType() (and enableCounters() to some extend). It feels pretty error prone as you need to think hard about whether you you call that first before calling some other method or not. It feels to me that again, we shouldn't need anything special with CellNameType for counters: thrift/CQL should convert things to/from the "internal" format, but as far as CellName/CellNameType are concerned, we shouldn't need anything special casing at all (I do understand that we probably need to start storing whether the table is dense or not in the system table for thrift to convert things properly, but it's high time we do that anyway). Mostly, they are there indeed to be able to distinguish between the original comparator being CompositeType(UTF8Type) and just UTF8Type, for example. With CASSANDRA-6717 this should become a non-issue, but that's 3.0, and we can always get rid of originalType() and enableCounters() later. Also, tbh, I'm starting to wonder if 2.1 is such a reasonable target for this at this point. Especially given we can't get rid of CounterCell right now (I think we should seriously consider getting rid of local/remote shard for 3.0) this feels like a lot of changes to push at the last minute, and it doesn't felt like it brings so much to the table that it can't wait 3.0. Especially since in 3.0 it might be achievable to get rid of CounterCell once and for all (and we wouldn't have to care about CQL2). Maybe. I wanted this in 2.1 to sweeten the upgrade to new counters by improving performance where we can (see the last set two sets of graphs in CASSANDRA-6553 , where 6556 writes are a lot smoother than writes w/out them). That and getting rid of CASSANDRA-6405 for good. Current reconcile code should die, really. I think we should seriously consider getting rid of local/remote shard for 3.0 We should, but I seriously don't see how this could be accomplished - hence not holding my breath for implementing counters as maps, for example. I believe disallowing the drop of columns for CQL3 counter tables is a regression. Not sure why the patch does it tbh. That's because using the timestamp field for the logical clock breaks the re-adding of previously dropped counter cells.
          Hide
          slebresne Sylvain Lebresne added a comment -

          To me, the main interest of doing this is to as much of special casing of counters at the storage engine layer as possible, to basically make counters just a bunch of normal cells. In that regard, I'm not convinced it's worth having special cellname implementations: the 3 methods added could easily be static methods of say CounterColumns, thus removing a bunch of code, and making it clear counters are just an encoding on top of the storage engine.

          I'm also uncomfortable with originalType() (and enableCounters() to some extend). It feels pretty error prone as you need to think hard about whether you you call that first before calling some other method or not. It feels to me that again, we shouldn't need anything special with CellNameType for counters: thrift/CQL should convert things to/from the "internal" format, but as far as CellName/CellNameType are concerned, we shouldn't need anything special casing at all (I do understand that we probably need to start storing whether the table is dense or not in the system table for thrift to convert things properly, but it's high time we do that anyway).

          In fact, I'm starting to think that the best way to do this would be to internally deal with counters as (CQL) maps of shard id -> count. This would allow reuse of the existing as often as possible (we wouldn't need modification of CQL3RowOfSparse for instance). One obstacle to that is that collections only work with true CQL3 table, and we need to deal with thrift CF of counters, but honestly, I think I'd be fine encoding those as true CQL3 tables under the hood having just one CQL column (that could have an empty name). With compression and since counters table will be composite under the hood anyway, I'm not sure the overhead would be noticeable and it would definitively simplify matters.

          Also, tbh, I'm starting to wonder if 2.1 is such a reasonable target for this at this point. Especially given we can't get rid of CounterCell right now (I think we should seriously consider getting rid of local/remote shard for 3.0) this feels like a lot of changes to push at the last minute, and it doesn't felt like it brings so much to the table that it can't wait 3.0. Especially since in 3.0 it might be achievable to get rid of CounterCell once and for all (and we wouldn't have to care about CQL2).

          Outside of those general comments, a few minor remarks/nits gathered from reading the patch:

          • isSameCQL3RowAs should use the comparator, not equals().
          • I believe disallowing the drop of columns for CQL3 counter tables is a regression. Not sure why the patch does it tbh.
          • Nit: I'd remove the 'clock' arg to CounterUpdateCell.create() and hardcode it to 1 inside the method to avoid misuse.
          • Nit: I was liking the inital comment of CounterUpdateCell.reconcile, I'd be happy keeping it.
          Show
          slebresne Sylvain Lebresne added a comment - To me, the main interest of doing this is to as much of special casing of counters at the storage engine layer as possible, to basically make counters just a bunch of normal cells. In that regard, I'm not convinced it's worth having special cellname implementations: the 3 methods added could easily be static methods of say CounterColumns, thus removing a bunch of code, and making it clear counters are just an encoding on top of the storage engine. I'm also uncomfortable with originalType() (and enableCounters() to some extend). It feels pretty error prone as you need to think hard about whether you you call that first before calling some other method or not. It feels to me that again, we shouldn't need anything special with CellNameType for counters: thrift/CQL should convert things to/from the "internal" format, but as far as CellName/CellNameType are concerned, we shouldn't need anything special casing at all (I do understand that we probably need to start storing whether the table is dense or not in the system table for thrift to convert things properly, but it's high time we do that anyway). In fact, I'm starting to think that the best way to do this would be to internally deal with counters as (CQL) maps of shard id -> count. This would allow reuse of the existing as often as possible (we wouldn't need modification of CQL3RowOfSparse for instance). One obstacle to that is that collections only work with true CQL3 table, and we need to deal with thrift CF of counters, but honestly, I think I'd be fine encoding those as true CQL3 tables under the hood having just one CQL column (that could have an empty name). With compression and since counters table will be composite under the hood anyway, I'm not sure the overhead would be noticeable and it would definitively simplify matters. Also, tbh, I'm starting to wonder if 2.1 is such a reasonable target for this at this point. Especially given we can't get rid of CounterCell right now (I think we should seriously consider getting rid of local/remote shard for 3.0) this feels like a lot of changes to push at the last minute, and it doesn't felt like it brings so much to the table that it can't wait 3.0. Especially since in 3.0 it might be achievable to get rid of CounterCell once and for all (and we wouldn't have to care about CQL2). Outside of those general comments, a few minor remarks/nits gathered from reading the patch: isSameCQL3RowAs should use the comparator, not equals(). I believe disallowing the drop of columns for CQL3 counter tables is a regression. Not sure why the patch does it tbh. Nit: I'd remove the 'clock' arg to CounterUpdateCell.create() and hardcode it to 1 inside the method to avoid misuse. Nit: I was liking the inital comment of CounterUpdateCell.reconcile, I'd be happy keeping it.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          https://github.com/iamaleksey/cassandra/commits/6506

          The branch is missing two things:
          1. Filter conversion between 2.0 and 2.1 nodes (read compatibility) - I'm still finishing it up/testing
          2. sstableimport/sstableexport/ASSTSW support for new counter cells (trivial)

          Extensive tests (unit and dtestes) will go into CASSANDRA-6626. I want this in sooner than that.

          Sylvain Lebresne Can you start reviewing it as is? I'll push the two commits that are left before you are done with the review of these.

          Show
          iamaleksey Aleksey Yeschenko added a comment - https://github.com/iamaleksey/cassandra/commits/6506 The branch is missing two things: 1. Filter conversion between 2.0 and 2.1 nodes (read compatibility) - I'm still finishing it up/testing 2. sstableimport/sstableexport/ASSTSW support for new counter cells (trivial) Extensive tests (unit and dtestes) will go into CASSANDRA-6626 . I want this in sooner than that. Sylvain Lebresne Can you start reviewing it as is? I'll push the two commits that are left before you are done with the review of these.

            People

            • Assignee:
              Unassigned
              Reporter:
              iamaleksey Aleksey Yeschenko
            • Votes:
              2 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

              • Created:
                Updated:

                Development