Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-8099

Refactor and modernize the storage engine

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 3.0 alpha 1
    • Component/s: None
    • Labels:
      None

      Description

      The current storage engine (which for this ticket I'll loosely define as "the code implementing the read/write path") is suffering from old age. One of the main problem is that the only structure it deals with is the cell, which completely ignores the more high level CQL structure that groups cell into (CQL) rows.

      This leads to many inefficiencies, like the fact that during a reads we have to group cells multiple times (to count on replica, then to count on the coordinator, then to produce the CQL resultset) because we forget about the grouping right away each time (so lots of useless cell names comparisons in particular). But outside inefficiencies, having to manually recreate the CQL structure every time we need it for something is hindering new features and makes the code more complex that it should be.

      Said storage engine also has tons of technical debt. To pick an example, the fact that during range queries we update SliceQueryFilter.count is pretty hacky and error prone. Or the overly complex ways AbstractQueryPager has to go into to simply "remove the last query result".

      So I want to bite the bullet and modernize this storage engine. I propose to do 2 main things:

      1. Make the storage engine more aware of the CQL structure. In practice, instead of having partitions be a simple iterable map of cells, it should be an iterable list of row (each being itself composed of per-column cells, though obviously not exactly the same kind of cell we have today).
      2. Make the engine more iterative. What I mean here is that in the read path, we end up reading all cells in memory (we put them in a ColumnFamily object), but there is really no reason to. If instead we were working with iterators all the way through, we could get to a point where we're basically transferring data from disk to the network, and we should be able to reduce GC substantially.

      Please note that such refactor should provide some performance improvements right off the bat but it's not it's primary goal either. It's primary goal is to simplify the storage engine and adds abstraction that are better suited to further optimizations.

      1. 8099-nit
        8 kB
        Benedict

        Issue Links

          Activity

          Hide
          slebresne Sylvain Lebresne added a comment -

          For information, and to hopefully clarify a bit more what I'd like to do here, I've push my WIP branch for this here. It's nowhere near complete, nothing is definitive and it's not even close of compiling. But it should hopefully have enough meat already to give a sense of the direction this is taking.

          Show
          slebresne Sylvain Lebresne added a comment - For information, and to hopefully clarify a bit more what I'd like to do here, I've push my WIP branch for this here . It's nowhere near complete, nothing is definitive and it's not even close of compiling. But it should hopefully have enough meat already to give a sense of the direction this is taking.
          Show
          jbellis Jonathan Ellis added a comment - /cc Jason Brown Pavel Yaskevich Aleksey Yeschenko Marcus Eriksson sankalp kohli T Jake Luciani
          Hide
          xedin Pavel Yaskevich added a comment -


          No-no-no, I think we should go all the way and re-write everything, gives us all the benefits in the single shot, plus who cares about supporting all the features anyway, right?

          Seriously tho, this sounds like a good first step.

          Show
          xedin Pavel Yaskevich added a comment - No-no-no, I think we should go all the way and re-write everything, gives us all the benefits in the single shot, plus who cares about supporting all the features anyway, right? Seriously tho, this sounds like a good first step.
          Hide
          jasobrown Jason Brown added a comment -

          I'm +1 on the idea. I poked through the code quickly, and it seemed in the right direction - although I'd have to read more carefully/think more wrt to some of my earlier thoughts about (pluggable) storage engines. Also, I see that 'Column' has made a comeback

          Show
          jasobrown Jason Brown added a comment - I'm +1 on the idea. I poked through the code quickly, and it seemed in the right direction - although I'd have to read more carefully/think more wrt to some of my earlier thoughts about (pluggable) storage engines. Also, I see that 'Column' has made a comeback
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          As I've said before, yeah, we should do this. Preferably spreading the whole thing over 3.x series though, limiting the amount of per-release changes, wherever it makes sense.

          Show
          iamaleksey Aleksey Yeschenko added a comment - As I've said before, yeah, we should do this. Preferably spreading the whole thing over 3.x series though, limiting the amount of per-release changes, wherever it makes sense.
          Hide
          jbellis Jonathan Ellis added a comment -

          How is this looking two weeks later? Any potential blockers come up?

          Show
          jbellis Jonathan Ellis added a comment - How is this looking two weeks later? Any potential blockers come up?
          Hide
          slebresne Sylvain Lebresne added a comment -

          How is this looking two weeks later?

          Prettier but not really near completion I'm afraid.

          Any potential blockers come up?

          Not really blockers, but I'll admit that this is bigger/longer that I though.

          Show
          slebresne Sylvain Lebresne added a comment - How is this looking two weeks later? Prettier but not really near completion I'm afraid. Any potential blockers come up? Not really blockers, but I'll admit that this is bigger/longer that I though.
          Hide
          slebresne Sylvain Lebresne added a comment -

          As an update on this I've (force) pushed to the branch.

          This is still not done but it compiles, run and work for basic queries. That said, at least reverse, distinct and 2ndary index queries (and a bunch of other smaller details) are known not to work (but that should change relatively soon). Also, I'll still need to work on thrift support and general backward compatibility for rolling upgrades. Nonetheless, the general design is here and shouldn't change drastically. Plus it's barely tested at this point, I need to add more comments.

          This is a huge patch. And while this doesn't change how Cassandra works in any fudamental way, this can't really be called a small incremental change. I'm the first to admit that those are not good things in theory, but to be honest, as this tries to clean up the abstractions that are at the core of the storage engine, I don't see how this could be done a whole lot more incrementally. I do am convinced that this will make implementing tons of outstanding features a lot more easy and is worth it just for that. Anyway, that's my (lame) excuse for this. Also, it's in a single big commit because I had a lot of back and forth while implementing so that my wip commits were more misleading than anything. At at latter point, if that's deemed better for review, I could try to split it in smaller commits of somewhat-related code (though I'm not entirely sure it will help tremendously).

          Show
          slebresne Sylvain Lebresne added a comment - As an update on this I've (force) pushed to the branch . This is still not done but it compiles, run and work for basic queries. That said, at least reverse, distinct and 2ndary index queries (and a bunch of other smaller details) are known not to work (but that should change relatively soon). Also, I'll still need to work on thrift support and general backward compatibility for rolling upgrades. Nonetheless, the general design is here and shouldn't change drastically. Plus it's barely tested at this point, I need to add more comments. This is a huge patch. And while this doesn't change how Cassandra works in any fudamental way, this can't really be called a small incremental change. I'm the first to admit that those are not good things in theory, but to be honest, as this tries to clean up the abstractions that are at the core of the storage engine, I don't see how this could be done a whole lot more incrementally. I do am convinced that this will make implementing tons of outstanding features a lot more easy and is worth it just for that. Anyway, that's my (lame) excuse for this. Also, it's in a single big commit because I had a lot of back and forth while implementing so that my wip commits were more misleading than anything. At at latter point, if that's deemed better for review, I could try to split it in smaller commits of somewhat-related code (though I'm not entirely sure it will help tremendously).
          Hide
          benedict Benedict added a comment -

          A small thing I noticed, that this nicely helps tidy up something that's bugged me before and can (and should) be cleaned up with this IMO: all of the sub-comparators in ClusteringComparator can be removed, and the class itself can simply implement Comparator<Clusterable>, since all of the things that can be compared now implement Clusterable (perhaps one sub-comparator could be used if directly comparing ClusteringPrefix is very common in high-traffic code paths, but it's probably not necessary).

          .rowComparator can be removed already and .atomComparator can be removed by only making MergeIterator accept <? super In> instead of <In>.

          clusterableComparator would then be subsumed by the enclosing class; everywhere it's referred to you can simply delete ".clusteringComparator" and it will work.

          I'm sure it's out of date now, but I've posted the diff anyway. The summary is:

           src/java/org/apache/cassandra/db/ClusteringComparator.java            | 33 ++++++---------------------------
           src/java/org/apache/cassandra/db/atoms/AtomIterators.java             |  2 +-
           src/java/org/apache/cassandra/db/filters/NamesPartitionFilter.java    |  2 +-
           src/java/org/apache/cassandra/db/partitions/AtomicBTreePartition.java |  4 ++--
           src/java/org/apache/cassandra/utils/MergeIterator.java                | 10 +++++-----
           5 files changed, 15 insertions(+), 36 deletions(-)
          
          Show
          benedict Benedict added a comment - A small thing I noticed, that this nicely helps tidy up something that's bugged me before and can (and should) be cleaned up with this IMO: all of the sub-comparators in ClusteringComparator can be removed, and the class itself can simply implement Comparator<Clusterable>, since all of the things that can be compared now implement Clusterable (perhaps one sub-comparator could be used if directly comparing ClusteringPrefix is very common in high-traffic code paths, but it's probably not necessary). .rowComparator can be removed already and .atomComparator can be removed by only making MergeIterator accept <? super In> instead of <In>. clusterableComparator would then be subsumed by the enclosing class; everywhere it's referred to you can simply delete ".clusteringComparator" and it will work. I'm sure it's out of date now, but I've posted the diff anyway. The summary is: src/java/org/apache/cassandra/db/ClusteringComparator.java | 33 ++++++--------------------------- src/java/org/apache/cassandra/db/atoms/AtomIterators.java | 2 +- src/java/org/apache/cassandra/db/filters/NamesPartitionFilter.java | 2 +- src/java/org/apache/cassandra/db/partitions/AtomicBTreePartition.java | 4 ++-- src/java/org/apache/cassandra/utils/MergeIterator.java | 10 +++++----- 5 files changed, 15 insertions(+), 36 deletions(-)
          Hide
          benedict Benedict added a comment -

          Also, I'm sure you're on top of this already, but this could do with a lot more commenting. Little things like what is meant by a "complex" field - with a bit of digging you can determine this is related to collections, but you have to jump to declaration quite a few times before establishing this. Simple descriptions of classes like "ComplexCellBlock" and what a class is used for, like in "CellWriter".

          It would really help to reduce the mystery that has surrounded some of the internal functionality. Right now I don't think anybody is fully conversant with even the current (post CASSANDRA-5417) functionality besides yourself. I know this rewrite will help a lot, but I doubt it will fix the problem entirely. There's a lot of legacy cruft to deal with still, that makes things complicated, and this rewrite will put almost everyone back to square one.

          Show
          benedict Benedict added a comment - Also, I'm sure you're on top of this already, but this could do with a lot more commenting. Little things like what is meant by a "complex" field - with a bit of digging you can determine this is related to collections, but you have to jump to declaration quite a few times before establishing this. Simple descriptions of classes like "ComplexCellBlock" and what a class is used for, like in "CellWriter". It would really help to reduce the mystery that has surrounded some of the internal functionality. Right now I don't think anybody is fully conversant with even the current (post CASSANDRA-5417 ) functionality besides yourself. I know this rewrite will help a lot, but I doubt it will fix the problem entirely. There's a lot of legacy cruft to deal with still, that makes things complicated, and this rewrite will put almost everyone back to square one.
          Hide
          slebresne Sylvain Lebresne added a comment -

          all of the sub-comparators in ClusteringComparator can be removed, and the class itself can simply implement Comparator<Clusterable>

          I'll incorporate that, thanks. I wanted to do it but I still had some crappy dependencies in the initial versions and didn't looked back at it after that was cleaned up.

          this could do with a lot more commenting.

          I couldn't agree more and plan on addressing that rather soon. My bad excuse for not doing it right away is that I had a lot of back and forth on the exact abstraction to make everything fits and didn't wanted to lose time commenting stuffs that might go away the next day. But I have a "got over it all and comment the shit out of it" task on my todo list.

          Show
          slebresne Sylvain Lebresne added a comment - all of the sub-comparators in ClusteringComparator can be removed, and the class itself can simply implement Comparator<Clusterable> I'll incorporate that, thanks. I wanted to do it but I still had some crappy dependencies in the initial versions and didn't looked back at it after that was cleaned up. this could do with a lot more commenting. I couldn't agree more and plan on addressing that rather soon. My bad excuse for not doing it right away is that I had a lot of back and forth on the exact abstraction to make everything fits and didn't wanted to lose time commenting stuffs that might go away the next day. But I have a "got over it all and comment the shit out of it" task on my todo list.
          Hide
          benedict Benedict added a comment -

          I think it's a shame this patch wasn't attempted at least a little more incrementally. It looks to me that changing the serialization formats, memtables and iterator implementations could have been done in different patches at least, and it might have made review safer and easier, and given us more time to digest the changes. I understand that this might have introduced some extra development burden, although that might have been reaped back in many fewer days spent rebasing. Having an initial patch to change the abstractions I think would have helped to reduce the burden on the rest of the project, and perhaps helped parallelize the work as well.

          I'm a little concerned that the result is that we won't give each of the pretty major decisions that have been made the time they need to be assessed properly, especially now we're ramping up for release (and hence low on time). I'm not necessarily suggesting we split it, as I can imagine that would be soul crushingly unpleasant for Sylvain Lebresne and introduce a delay, but I am generally ill at ease with the scope of the changes and our ability to vet them. I'm also worried I'm finding myself saying "too close to release to question this decision" - which seems a problematic mode to be merging any patch under.

          Show
          benedict Benedict added a comment - I think it's a shame this patch wasn't attempted at least a little more incrementally. It looks to me that changing the serialization formats, memtables and iterator implementations could have been done in different patches at least, and it might have made review safer and easier, and given us more time to digest the changes. I understand that this might have introduced some extra development burden, although that might have been reaped back in many fewer days spent rebasing. Having an initial patch to change the abstractions I think would have helped to reduce the burden on the rest of the project, and perhaps helped parallelize the work as well. I'm a little concerned that the result is that we won't give each of the pretty major decisions that have been made the time they need to be assessed properly, especially now we're ramping up for release (and hence low on time). I'm not necessarily suggesting we split it, as I can imagine that would be soul crushingly unpleasant for Sylvain Lebresne and introduce a delay, but I am generally ill at ease with the scope of the changes and our ability to vet them. I'm also worried I'm finding myself saying "too close to release to question this decision" - which seems a problematic mode to be merging any patch under.
          Hide
          slebresne Sylvain Lebresne added a comment -

          I think it's a shame this patch wasn't attempted at least a little more incrementally.

          I certainly understand the criticism, this is definitively not as incremental as it should be. My lame "defence" is that since this structurally changes the main abstraction used by the storage engine, it quickly trickles down to everything else, so that I just wasn't sure how to attack this more incrementally in practice. For the serialization formats, I could indeed have stick to serializing to the old format, but given the mismatch between the old format and the new abstractions, it was actually simpler to just write in a meaningful format right away (it allowed me to get something working faster). And since the new serialization format details are fairly well encapsulated (mostly in AtomSerializer.java), I'll admit it didn't felt like a huge deal overall. But in any case, I probably haven't tried hard enough and/or I'm not smart enough to have figured out how to make that happen more incrementally and for that, I apologize.

          I'm also worried I'm finding myself saying "too close to release to question this decision"

          I agree that not questioning a decision that you think is worth questioning should be avoided, but I also don't think that this needs to be the case. If you think a decision make things worth than it is in current trunk, then by all mean, let's bring it. If there is enough such concerns voiced that makes us think this patch won't be a net improvement over the status quo and there is no time to address those concerns, then I'll be the first to suggest that, as sad as that would make me, we should consider pushing it after 3.0 (but I do have the weakness to think that the patch is a net improvement).

          Now, I don't pretend that every choice made here is absolutely optimal (I'm afraid I'm not that smart) so there will things that can be improved (and maybe some will require subsequent changes). But as long as something doesn't make things worth than they currently are, I'd suggest is probably ok to just create tickets for those improvements. After all, this isn't meant at all to be the definitive version of Cassandra ode, it just pretend to be cleaner grounds to improve upon than we currently have.

          Don't get me wrong, I'm not trying to say that such a big patch is ideal, it's not. I just didn't figured out how to do better.

          Show
          slebresne Sylvain Lebresne added a comment - I think it's a shame this patch wasn't attempted at least a little more incrementally. I certainly understand the criticism, this is definitively not as incremental as it should be. My lame "defence" is that since this structurally changes the main abstraction used by the storage engine, it quickly trickles down to everything else, so that I just wasn't sure how to attack this more incrementally in practice. For the serialization formats, I could indeed have stick to serializing to the old format, but given the mismatch between the old format and the new abstractions, it was actually simpler to just write in a meaningful format right away (it allowed me to get something working faster). And since the new serialization format details are fairly well encapsulated (mostly in AtomSerializer.java ), I'll admit it didn't felt like a huge deal overall. But in any case, I probably haven't tried hard enough and/or I'm not smart enough to have figured out how to make that happen more incrementally and for that, I apologize. I'm also worried I'm finding myself saying "too close to release to question this decision" I agree that not questioning a decision that you think is worth questioning should be avoided, but I also don't think that this needs to be the case. If you think a decision make things worth than it is in current trunk, then by all mean, let's bring it. If there is enough such concerns voiced that makes us think this patch won't be a net improvement over the status quo and there is no time to address those concerns, then I'll be the first to suggest that, as sad as that would make me, we should consider pushing it after 3.0 (but I do have the weakness to think that the patch is a net improvement). Now, I don't pretend that every choice made here is absolutely optimal (I'm afraid I'm not that smart) so there will things that can be improved (and maybe some will require subsequent changes). But as long as something doesn't make things worth than they currently are, I'd suggest is probably ok to just create tickets for those improvements. After all, this isn't meant at all to be the definitive version of Cassandra ode, it just pretend to be cleaner grounds to improve upon than we currently have. Don't get me wrong, I'm not trying to say that such a big patch is ideal, it's not. I just didn't figured out how to do better.
          Hide
          slebresne Sylvain Lebresne added a comment -

          I've just rebased and (force) pushed the current version of the patch to the usual branch. It still doesn't handle thrift and misses backward compatibility code for the internal messages (and I'll start working on those) but it's basically complete otherwise. It passes all the CQL tests (unit and dtests) we have in particular. Also seems to be passing other dtests (that don't use thrift) but I'll admit I haven't had the patience to run them all locally and jenkins since to be in a bad mood recently, so a couple might require attention but that's likely minor. Also, I haven't taken the time to upgrade most of our unit tests and this will be done next with the help of some others, but hopefully the CQL tests and dtests exercise the code changed enough that they shouldn't be major surprises.

          Overall the missing parts are sufficiently isolated that I think initial review can be started. I've actually written here some kind of overview/guide for the sake of making diving into the patch easier. I'll be happy to update it if there is something missing that would help.

          Show
          slebresne Sylvain Lebresne added a comment - I've just rebased and (force) pushed the current version of the patch to the usual branch . It still doesn't handle thrift and misses backward compatibility code for the internal messages (and I'll start working on those) but it's basically complete otherwise. It passes all the CQL tests (unit and dtests) we have in particular. Also seems to be passing other dtests (that don't use thrift) but I'll admit I haven't had the patience to run them all locally and jenkins since to be in a bad mood recently, so a couple might require attention but that's likely minor. Also, I haven't taken the time to upgrade most of our unit tests and this will be done next with the help of some others, but hopefully the CQL tests and dtests exercise the code changed enough that they shouldn't be major surprises. Overall the missing parts are sufficiently isolated that I think initial review can be started. I've actually written here some kind of overview/guide for the sake of making diving into the patch easier. I'll be happy to update it if there is something missing that would help.
          Hide
          benedict Benedict added a comment -

          Like I said, it's a shame; I lament not having longer to criticise the less optimal decisions. That's not at all to suggest they will cumulatively sabotage this patch to worse than the status quo. But the bar for improvement is much higher once a round of changes goes in (not least because the effort of maintaining compatibility each time, but also because it has to be justified afresh, and be worth the risk, argumentation, redevelopment, etc.), and so we will find ourselves settling more readily than had we considered our options more carefully up front, especially when there are so many aspects to discuss. I don't think there is much to be done about it now, though, given the time constraints, and we will simply have to do our best.

          Anyway, I'll try to properly digest the patch over the next week or so, so I can give some actual concrete feedback. On the whole I do think it is a huge step forward (well, perhaps not the naming ). I just wish we weren't rushing this part after waiting so long for it, and that we had at least discussed some of the more concrete aspects of the design in advance.

          The concern I have about the scope being too large to vet effectively is somewhat uncorrelated, but I don't have a good answer for that either. My experience is that review's capacity for finding problems doesn't scale linearly with the scope and complexity of a patch, and I don't think we've ever had a patch as large as this (it's basically a whole version jump on its own). Of course, if you're planing to break 3.0 just to make me feel better about breaking 2.1, I'm cool with that

          Show
          benedict Benedict added a comment - Like I said, it's a shame; I lament not having longer to criticise the less optimal decisions. That's not at all to suggest they will cumulatively sabotage this patch to worse than the status quo. But the bar for improvement is much higher once a round of changes goes in (not least because the effort of maintaining compatibility each time, but also because it has to be justified afresh, and be worth the risk, argumentation, redevelopment, etc.), and so we will find ourselves settling more readily than had we considered our options more carefully up front, especially when there are so many aspects to discuss. I don't think there is much to be done about it now, though, given the time constraints, and we will simply have to do our best. Anyway, I'll try to properly digest the patch over the next week or so, so I can give some actual concrete feedback. On the whole I do think it is a huge step forward (well, perhaps not the naming ). I just wish we weren't rushing this part after waiting so long for it, and that we had at least discussed some of the more concrete aspects of the design in advance. The concern I have about the scope being too large to vet effectively is somewhat uncorrelated, but I don't have a good answer for that either. My experience is that review's capacity for finding problems doesn't scale linearly with the scope and complexity of a patch, and I don't think we've ever had a patch as large as this (it's basically a whole version jump on its own). Of course, if you're planing to break 3.0 just to make me feel better about breaking 2.1, I'm cool with that
          Hide
          snazy Robert Stupp added a comment -

          As you say in the doc, naming of atom is really bad and should be changed IMO. Some proposals:

          • Cluster - it's like that - but cluster is an occupied term
          • Line (slightly similar to row)
          • Assembly
          • or maybe just RawRow

          NamesPartitionFilter - not sure whether names is a good word here. Propose ClusteringPartitionFilter or ClusteredPartitionFilter

          Good idea to make CachePartition an interface!

          BTW: interesting to see that the term Doppelgänger is known in English

          Show
          snazy Robert Stupp added a comment - As you say in the doc , naming of atom is really bad and should be changed IMO. Some proposals: Cluster - it's like that - but cluster is an occupied term Line (slightly similar to row ) Assembly or maybe just RawRow NamesPartitionFilter - not sure whether names is a good word here. Propose ClusteringPartitionFilter or ClusteredPartitionFilter Good idea to make CachePartition an interface! BTW: interesting to see that the term Doppelgänger is known in English
          Hide
          jbellis Jonathan Ellis added a comment -

          I'd like Aleksey to be "chief reviewer" with input from (at least) Benedict and Benjamin.

          Show
          jbellis Jonathan Ellis added a comment - I'd like Aleksey to be "chief reviewer" with input from (at least) Benedict and Benjamin.
          Hide
          tjake T Jake Luciani added a comment -

          For the purposes of bug fixing and testing we "reviewers/helpers" should submit pull requests to Sylvain's github branch for inclusion. With the final merged version of his branch being committed.

          Show
          tjake T Jake Luciani added a comment - For the purposes of bug fixing and testing we "reviewers/helpers" should submit pull requests to Sylvain's github branch for inclusion. With the final merged version of his branch being committed.
          Hide
          blerer Benjamin Lerer added a comment -

          As the patch is relatively large I have choose to split my review of the CQL layer into chunks and give my comments for each chunk as soon as I have finished reviewing it. I think it will make the things more manageable for Sylvain and me.

          For the first chunk I focused on the restrictions:

          • I am not a big fan of big class hierachies but I wonder if it will not be better to have two sub-classes for PrimaryKeyRestrictionSet one for the partition key and one for the clustering columns rather than having a boolean variable.
          • In PrimaryKeyRestrictionSet the method addColumnFilterTo can be simplified based on the fact that we know if the restrictions are on the partition key components or on the clustering key columns.
          • The AbstractPrimaryKeyRestrictions.toByteBuffers method can be moved down as it is only used in PrimaryKeyRestrictionSet
          • In MultiColumnRestriction the method isPartitionKey() is not used (in case you have forgotten: MultiColumnRestriction only apply to clustering key columns).
          • I understand why you renamed ?Restriction.Slice to ?Restriction.SliceRestriction but now the class names look a bit inconsistent. May be we should rename the other classes too.
          • In ColumnFilter the add(Expression expression) method is not used.
          • In Operator the reverse method is not needed anymore and can be removed.
          • In StatementRestrictions I do not understand the use of useFiltering. My understanding was that we should return an error message specifying that ALLOW FILTERING is required and that this problem should have been handled by checkNeedsFiltering in SelectStatement. Could you explain?
          • In StatementRestrictions the nonPKRestrictedColumns method look wrong to me as it can return some primary key columns.
          Show
          blerer Benjamin Lerer added a comment - As the patch is relatively large I have choose to split my review of the CQL layer into chunks and give my comments for each chunk as soon as I have finished reviewing it. I think it will make the things more manageable for Sylvain and me. For the first chunk I focused on the restrictions : I am not a big fan of big class hierachies but I wonder if it will not be better to have two sub-classes for PrimaryKeyRestrictionSet one for the partition key and one for the clustering columns rather than having a boolean variable. In PrimaryKeyRestrictionSet the method addColumnFilterTo can be simplified based on the fact that we know if the restrictions are on the partition key components or on the clustering key columns. The AbstractPrimaryKeyRestrictions.toByteBuffers method can be moved down as it is only used in PrimaryKeyRestrictionSet In MultiColumnRestriction the method isPartitionKey() is not used (in case you have forgotten: MultiColumnRestriction only apply to clustering key columns). I understand why you renamed ?Restriction.Slice to ?Restriction.SliceRestriction but now the class names look a bit inconsistent. May be we should rename the other classes too. In ColumnFilter the add(Expression expression) method is not used. In Operator the reverse method is not needed anymore and can be removed. In StatementRestrictions I do not understand the use of useFiltering . My understanding was that we should return an error message specifying that ALLOW FILTERING is required and that this problem should have been handled by checkNeedsFiltering in SelectStatement . Could you explain? In StatementRestrictions the nonPKRestrictedColumns method look wrong to me as it can return some primary key columns.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Thanks for some initial review. I've pushed a commit for most of the remarks. I answer the rest below:

          I wonder if it will not be better to have two sub-classes for PrimaryKeyRestrictionSet one for the partition key and one for the clustering columns rather than having a boolean variable

          I don't disagree, but think we should generally clean the handling of partition keys so it doesn't "pretend" to be clustering columns (which will imply separating into 2 classes). And so I'd like to do that properly as a followup since it's not essential to this ticket (and I'm sure we can agree it's big enough as is).

          In StatementRestrictions I do not understand the use of useFiltering. My understanding was that we should return an error message specifying that ALLOW FILTERING is required and that this problem should have been handled by checkNeedsFiltering in SelectStatement.

          If you have restriction on an indexed column but that restriction is no "queriable" (not an equality), we actually always reject the query (as in, even with ALLOW FILTERING) with an error message that says we can't find a usable 2ndary index. I'm not saying this is a good thing, it's really just historical and we should fix it, but this ticket is arguably not the right place for this (CASSANDRA-4987 would typically be a better place for that).

          We also don't even call needsFiltering if the query is not a range one (because we don't support ALLOW FILTERING there yet, which is CASSANDRA-6377), but we should still reject the queries desribed above (restriction on an indexed column but not one usable by the index) for single partition queries.

          Another way to put it is that the added validation is just the validation that is done on trunk in SecondaryIndexManager.validateIndexSearchersForQuery (and so was not handled by checkNeedsFiltering) which I moved in StatementRestrictions because that was convenient for the patch. TL;DR, we should clean all this in follow-ups, but I'd rather keep it simple for this ticket.

          Show
          slebresne Sylvain Lebresne added a comment - Thanks for some initial review. I've pushed a commit for most of the remarks. I answer the rest below: I wonder if it will not be better to have two sub-classes for PrimaryKeyRestrictionSet one for the partition key and one for the clustering columns rather than having a boolean variable I don't disagree, but think we should generally clean the handling of partition keys so it doesn't "pretend" to be clustering columns (which will imply separating into 2 classes). And so I'd like to do that properly as a followup since it's not essential to this ticket (and I'm sure we can agree it's big enough as is). In StatementRestrictions I do not understand the use of useFiltering . My understanding was that we should return an error message specifying that ALLOW FILTERING is required and that this problem should have been handled by checkNeedsFiltering in SelectStatement . If you have restriction on an indexed column but that restriction is no "queriable" (not an equality), we actually always reject the query (as in, even with ALLOW FILTERING ) with an error message that says we can't find a usable 2ndary index. I'm not saying this is a good thing, it's really just historical and we should fix it, but this ticket is arguably not the right place for this ( CASSANDRA-4987 would typically be a better place for that). We also don't even call needsFiltering if the query is not a range one (because we don't support ALLOW FILTERING there yet, which is CASSANDRA-6377 ), but we should still reject the queries desribed above (restriction on an indexed column but not one usable by the index) for single partition queries. Another way to put it is that the added validation is just the validation that is done on trunk in SecondaryIndexManager.validateIndexSearchersForQuery (and so was not handled by checkNeedsFiltering ) which I moved in StatementRestrictions because that was convenient for the patch. TL;DR, we should clean all this in follow-ups, but I'd rather keep it simple for this ticket.
          Hide
          snazy Robert Stupp added a comment -

          Just found some naming issues and some nits.

          Altogether I have to say, that CASSANDRA-8099 is a great step forward! It simplifies a lot of areas in the code, explains a lot of things in javadoc and makes it a lot easier to follow the travelled code path using (mostly ) appropriate nomenclature.

          NITs

          • org.apache.cassandra.db.ReadQuery can be an interface (only abstract methods) and is mentioned as an interface in the javadoc
          • org.apache.cassandra.config.CFMetaData#columnMetadata can be final
          • org.apache.cassandra.config.CFMetaData#getDefaultIndexName, #isNameValid, #isIndexNameValid use non-compiled regexp

          I didn’t create a pull-req since all findings above are just nits and could only make final rebasing harder.

          Nomenclature

          The name ColumnFilter is a bit misleading. From the first impression I thought it’s a filter that filters CQL columns - but it’s used to to a 2i lookup.

          Can you rename NamesPartitionFilter to something with clustering key in it? I know that the term name is used elsewhere for clustering key.

          CBuilder/MultiCBuilder could be more expressive as ClusteringBuilder/MultipleClusteringsBuilder

          Misc

          The first time I ran into a situation where cluster_name and host_id were null in system.local. But had no luck reproducing this (I’m sure I did a ant realclean jar and rm -rf data/* before). So just take this as a note - not something worth to discuss.

          I did a quick&dirty prototype of CASSANDRA-7396 based on 8099 and it looks much easier (without the slicing stuff).

          Show
          snazy Robert Stupp added a comment - Just found some naming issues and some nits. Altogether I have to say, that CASSANDRA-8099 is a great step forward! It simplifies a lot of areas in the code, explains a lot of things in javadoc and makes it a lot easier to follow the travelled code path using (mostly ) appropriate nomenclature. NITs org.apache.cassandra.db.ReadQuery can be an interface (only abstract methods) and is mentioned as an interface in the javadoc org.apache.cassandra.config.CFMetaData#columnMetadata can be final org.apache.cassandra.config.CFMetaData#getDefaultIndexName , #isNameValid , #isIndexNameValid use non-compiled regexp I didn’t create a pull-req since all findings above are just nits and could only make final rebasing harder. Nomenclature The name ColumnFilter is a bit misleading. From the first impression I thought it’s a filter that filters CQL columns - but it’s used to to a 2i lookup. Can you rename NamesPartitionFilter to something with clustering key in it? I know that the term name is used elsewhere for clustering key. CBuilder / MultiCBuilder could be more expressive as ClusteringBuilder / MultipleClusteringsBuilder Misc The first time I ran into a situation where cluster_name and host_id were null in system.local . But had no luck reproducing this (I’m sure I did a ant realclean jar and rm -rf data/* before). So just take this as a note - not something worth to discuss. I did a quick&dirty prototype of CASSANDRA-7396 based on 8099 and it looks much easier (without the slicing stuff).
          Hide
          blerer Benjamin Lerer added a comment -

          Here is my second chunk of feedbacks on the CQL layer:

          • In Selection the method containsPartitionKeyColumns is not used.
          • The class ReadQuery does not contains any implementation, so it should probably be an interface.
          • PartitionFilter, NamesPartitionFilter and SlicePartitionFilter contains some commented code that should probably be removed.
          • In NamesPartitionFilter and SlicePartitionFilter the toCQLString methods contains some duplicate code for the ORDER BY that can be extracted and put in AbstractPartitionFilter
          • In SelectStatement:
            • in the gatherQueriedColumns method the PartitionColumns.Builder.addAll method could be used to simplify the code
            • in the process(DataIterator) method the options variable could be inlined
            • the method getSliceCommands should return a ReadQuery as it will help to simplify the getQuery method
            • nowInSec should be generated in getQuery instead of being provided as argument
            • the logger variable is not used
            • I wonder if it will not be cleaner to have an empty ReadQuery than to deal with null
          • In BatchStatement conditionColumns does not seems to be read
          • In QueryPager:
            • the pager(ReadQuery, ConsistencyLevel, ClientState) method is not used
            • there are some commented code with a TODO which provide no information
            • instead of using instanceof the code of pager(ReadQuery, ConsistencyLevel, ClientState, boolean, PagingState) should probably use polymorphism
          Show
          blerer Benjamin Lerer added a comment - Here is my second chunk of feedbacks on the CQL layer: In Selection the method containsPartitionKeyColumns is not used. The class ReadQuery does not contains any implementation, so it should probably be an interface. PartitionFilter , NamesPartitionFilter and SlicePartitionFilter contains some commented code that should probably be removed. In NamesPartitionFilter and SlicePartitionFilter the toCQLString methods contains some duplicate code for the ORDER BY that can be extracted and put in AbstractPartitionFilter In SelectStatement : in the gatherQueriedColumns method the PartitionColumns.Builder.addAll method could be used to simplify the code in the process(DataIterator) method the options variable could be inlined the method getSliceCommands should return a ReadQuery as it will help to simplify the getQuery method nowInSec should be generated in getQuery instead of being provided as argument the logger variable is not used I wonder if it will not be cleaner to have an empty ReadQuery than to deal with null In BatchStatement conditionColumns does not seems to be read In QueryPager : the pager(ReadQuery, ConsistencyLevel, ClientState) method is not used there are some commented code with a TODO which provide no information instead of using instanceof the code of pager(ReadQuery, ConsistencyLevel, ClientState, boolean, PagingState) should probably use polymorphism
          Hide
          benedict Benedict added a comment -

          So I'm still trying to digest the entirety of the patch (amongst other things), and have been reticent to give feedback in advance of this. Since being fully conversant is likely a way off, I figure I should highlight earlier the biggest (labour-wise) concern I have.

          Flyweights. They complicate the code, introducing a lot of extra implementations of classes. This is both a cognitive burden, but also an issue for the optimizer. As an example, Clustering looks like it could likely get away with just a single implementation, or perhaps two, as opposed to the current eight. Since these classes are accessed everywhere, and often, having efficient method despatch is important. But also classes like Sorting would be unnecessary, and we could depend on Java sorting, and things like the RowDataBlock would not need to have such complexity for overloading of behaviour to support both rows and collections of rows.

          The upside of the flyweights AFAICT is very slim. There is only a very tiny amount of only temporary heap space saved, since the vast majority of the data (the ByteBuffer objects) are still floating around for every value. I would be surprised if we measurably reduced GC burden, and in fact since their lifetimes are often longer they may be promoted with higher likelihood. Of course, I may be missing some major beneficial case that is important, but either way I figure we should start a discussion, sooner than later, on whether or not it is worth retaining them in this patch. It is possible that the abstraction may be useful in future, but I don't think it is right now, and I would prefer to introduce it if and when we're confident it will be of use, and to benchmark it independently of these other complex changes.

          Show
          benedict Benedict added a comment - So I'm still trying to digest the entirety of the patch (amongst other things), and have been reticent to give feedback in advance of this. Since being fully conversant is likely a way off, I figure I should highlight earlier the biggest (labour-wise) concern I have. Flyweights. They complicate the code, introducing a lot of extra implementations of classes. This is both a cognitive burden, but also an issue for the optimizer. As an example, Clustering looks like it could likely get away with just a single implementation, or perhaps two, as opposed to the current eight. Since these classes are accessed everywhere , and often, having efficient method despatch is important. But also classes like Sorting would be unnecessary, and we could depend on Java sorting, and things like the RowDataBlock would not need to have such complexity for overloading of behaviour to support both rows and collections of rows. The upside of the flyweights AFAICT is very slim. There is only a very tiny amount of only temporary heap space saved, since the vast majority of the data (the ByteBuffer objects) are still floating around for every value. I would be surprised if we measurably reduced GC burden, and in fact since their lifetimes are often longer they may be promoted with higher likelihood. Of course, I may be missing some major beneficial case that is important, but either way I figure we should start a discussion, sooner than later, on whether or not it is worth retaining them in this patch. It is possible that the abstraction may be useful in future, but I don't think it is right now, and I would prefer to introduce it if and when we're confident it will be of use, and to benchmark it independently of these other complex changes.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Flyweights.

          I kind of agree. I went with them because there was a desire to use for a while on CASSANDRA-5019 and it felt that using them from day one might not be that hard. And I kind of went with it because I wanted to finish with the APIs and the code that use them before thinking too much about the more internal implementations.

          But you're right, it does make some things more complex than they have to be, and the gains are unclear at best (as in, it might hurt more than it helps). Hopefully though, the APIs don't really depend on them and so changing this shouldn't be too hard. I want to finish the last outstanding parts: thrift (which is coming along) and read old formats, but then I'll have a look at removing the flyweights, at least the main ones, and simplify accordingly.

          Show
          slebresne Sylvain Lebresne added a comment - Flyweights. I kind of agree. I went with them because there was a desire to use for a while on CASSANDRA-5019 and it felt that using them from day one might not be that hard. And I kind of went with it because I wanted to finish with the APIs and the code that use them before thinking too much about the more internal implementations. But you're right, it does make some things more complex than they have to be, and the gains are unclear at best (as in, it might hurt more than it helps). Hopefully though, the APIs don't really depend on them and so changing this shouldn't be too hard. I want to finish the last outstanding parts: thrift (which is coming along) and read old formats, but then I'll have a look at removing the flyweights, at least the main ones, and simplify accordingly.
          Hide
          benedict Benedict added a comment -

          FTR, I plan to leave review of the Partition hierarchy (in particular the memtable stuff) until after that refactor, but am attempting to tackle the rest of the code in the meantime.

          As I do this, I will arrange pull requests (as I have done just now). I will only post updates here if the pull request is in any way substantial, so that they can be discussed on the permanent record. I think this will be more efficient than my posting suggestions/criticisms, since you only have so much time on your hands, and this permits a degree of parallelization that may help us reach a better end state in the time available.

          Show
          benedict Benedict added a comment - FTR, I plan to leave review of the Partition hierarchy (in particular the memtable stuff) until after that refactor, but am attempting to tackle the rest of the code in the meantime. As I do this, I will arrange pull requests (as I have done just now). I will only post updates here if the pull request is in any way substantial, so that they can be discussed on the permanent record. I think this will be more efficient than my posting suggestions/criticisms, since you only have so much time on your hands, and this permits a degree of parallelization that may help us reach a better end state in the time available.
          Hide
          tjake T Jake Luciani added a comment -

          After some time spent with the new API my biggest concern is using OpOrder in iterators is unsafe. To address this we talked about adding code analysis to find improper use of auto-closable, but ideally we should replace this with something safer, like ref counting the op order so it will get cleaned up worst case by leak detector.

          Show
          tjake T Jake Luciani added a comment - After some time spent with the new API my biggest concern is using OpOrder in iterators is unsafe. To address this we talked about adding code analysis to find improper use of auto-closable, but ideally we should replace this with something safer, like ref counting the op order so it will get cleaned up worst case by leak detector.
          Hide
          benedict Benedict added a comment -

          my biggest concern is using OpOrder in iterators is unsafe

          I agree with this being dangerous, but I would prefer not to rush into using Ref. it isn't intended for use on a critical path. We at least need to address CASSANDRA-9379, but even then I would prefer to only use a Ref in situations where the iterator will go out of scope. i.e. where we cannot use it in a try/finally block. In this case, we can just explicitly wrap it in a Ref object, and always access it via the Ref, so it's clear we're being safe.

          We should perhaps try and rustle up some static analysis to tell us where we do not use a try/finally, and fail the compile if we don't use a Ref there. If this is too difficult then I am with T Jake Luciani and would prefer to err on the side of caution, by introducing CASSANDRA-9379 and always wrapping the OpOrder in a Ref.

          Show
          benedict Benedict added a comment - my biggest concern is using OpOrder in iterators is unsafe I agree with this being dangerous, but I would prefer not to rush into using Ref. it isn't intended for use on a critical path. We at least need to address CASSANDRA-9379 , but even then I would prefer to only use a Ref in situations where the iterator will go out of scope. i.e. where we cannot use it in a try/finally block. In this case, we can just explicitly wrap it in a Ref object, and always access it via the Ref, so it's clear we're being safe. We should perhaps try and rustle up some static analysis to tell us where we do not use a try/finally, and fail the compile if we don't use a Ref there. If this is too difficult then I am with T Jake Luciani and would prefer to err on the side of caution, by introducing CASSANDRA-9379 and always wrapping the OpOrder in a Ref.
          Hide
          benedict Benedict added a comment -

          It looks like this is easier than expected:

          Resource not managed via try-with-resource (1.7 or higher)
          When enabled, the compiler will issue an error or a warning if a local variable holds a value of type 'java.lang.AutoCloseable', and if the method 'close()' is explicitly invoked on that resource, but the resource is not managed by a try-with-resources block.

          Show
          benedict Benedict added a comment - It looks like this is easier than expected : Resource not managed via try-with-resource (1.7 or higher) When enabled, the compiler will issue an error or a warning if a local variable holds a value of type 'java.lang.AutoCloseable', and if the method 'close()' is explicitly invoked on that resource, but the resource is not managed by a try-with-resources block.
          Hide
          tjake T Jake Luciani added a comment - - edited

          Hmm, this is eclipse only?

          Show
          tjake T Jake Luciani added a comment - - edited Hmm, this is eclipse only?
          Hide
          benedict Benedict added a comment -

          Yeah. It would mean setting up ecj as another target, unfortunately, but that is probably worth the effort.

          Show
          benedict Benedict added a comment - Yeah. It would mean setting up ecj as another target, unfortunately, but that is probably worth the effort.
          Hide
          tjake T Jake Luciani added a comment -

          I see. I think Robert Stupp would know how todo this from CASSANDRA-8241?

          Show
          tjake T Jake Luciani added a comment - I see. I think Robert Stupp would know how todo this from CASSANDRA-8241 ?
          Hide
          snazy Robert Stupp added a comment -

          Do you think of using ecj in build.xml or in IDEA?

          Show
          snazy Robert Stupp added a comment - Do you think of using ecj in build.xml or in IDEA?
          Hide
          benedict Benedict added a comment -

          I've filed CASSANDRA-9431 to continue this discussion to avoid polluting this ticket

          Show
          benedict Benedict added a comment - I've filed CASSANDRA-9431 to continue this discussion to avoid polluting this ticket
          Hide
          blambov Branimir Lambov added a comment -

          Issues I am seeing in the changes in compaction (org.apache.cassandra.db.compaction package), in order of significance:

          • Side effects of iteration:
            • CompactionIterable 125: I doubt index update belongs here, as side effect of iteration. Ideally index should be collected, not updated. Presumably mergedCell is already being added by writer, perhaps cleaning other entries there is a better way to go.
            • CompactionIterable 237: Another side effect of iteration that preferably should be handled by writer.
            • Validation compaction now uses CompactionIterable and thus has side effects (index & cache removal).
          • Iterator closing:
            • CompactionManager 900: Several issues:
              • partition can be used after it is closed.
              • partition is not always closed: this closes it, but 895 and 870 don't.
              • Why is the iterator closed here rather than in the scope that opened it?
              • Please use explicit close of partition. This is an abuse of the try syntax (not obvious behavior, hidden from view).
            • CompactionManager 1052, 1160: Partition is not closed.
          • Stuff that could be confusing:
            • In guide_8099, description of RangeTombstoneMarkers: add that there is never content between two corresponding tombstone markers on any iterator. PurgingPartitionIterator for one relies on this.
            • CompactionIterable 83: Are we interested in highest-index nowInSec() only? Not max/min? Could you add an explanation in comment if that's the case?
            • CompactionIterable 47: AtomicInteger unnecessary-- remove as it could be understood to mean thread-safety where there's none.
            • [Abstract]CompactionIterable are misnamed. They are now iterators.
          • Nits:

          I haven't examined correctness/equivalence of (Partition|Atom)Iterators.merge() to older code yet.

          Looking at 8099 overall, I don't know if that's a discussion you have had before and I haven't participated in, but I have strong objections to using iterators as the representations of tables and partitions in the abstraction. This appears to be carried over from the old code, but they now gain additional content and behaviour which make them even less suited to the role. These objects should be Iterable instead. Having that would give clear separation between the iteration process and the entity-level data and behaviour which is currently a cause of multiple problems, including the top two categories above.

          I also think OpOrder.Group.close() does not belong in an iterator close. To minimize the potential for trouble, grouping/locking should be explicit in the user of the iterator/entity. For the instances where this is not feasible, perhaps the initiating methods could be called "openX" or "startX" vs "X" (e.g. openSearch vs search).

          Show
          blambov Branimir Lambov added a comment - Issues I am seeing in the changes in compaction (org.apache.cassandra.db.compaction package), in order of significance: Side effects of iteration: CompactionIterable 125 : I doubt index update belongs here, as side effect of iteration. Ideally index should be collected, not updated. Presumably mergedCell is already being added by writer, perhaps cleaning other entries there is a better way to go. CompactionIterable 237 : Another side effect of iteration that preferably should be handled by writer. Validation compaction now uses CompactionIterable and thus has side effects (index & cache removal). Iterator closing: CompactionManager 900 : Several issues: partition can be used after it is closed. partition is not always closed: this closes it, but 895 and 870 don't. Why is the iterator closed here rather than in the scope that opened it? Please use explicit close of partition . This is an abuse of the try syntax (not obvious behavior, hidden from view). CompactionManager 1052 , 1160 : Partition is not closed. Stuff that could be confusing: In guide_8099 , description of RangeTombstoneMarkers : add that there is never content between two corresponding tombstone markers on any iterator. PurgingPartitionIterator for one relies on this. CompactionIterable 83 : Are we interested in highest-index nowInSec() only? Not max/min? Could you add an explanation in comment if that's the case? CompactionIterable 47 : AtomicInteger unnecessary-- remove as it could be understood to mean thread-safety where there's none. [Abstract]CompactionIterable are misnamed. They are now iterators. Nits: CompactionIterable 48 : format is never used. CompactionIterable 100 : column is not used. CompactionManager 785 : Revert Collections.singleton(sstable) to sstableSet CompactionManager 943 : Revert removal of break . Imports could be cleaned. I haven't examined correctness/equivalence of (Partition|Atom)Iterators.merge() to older code yet. Looking at 8099 overall, I don't know if that's a discussion you have had before and I haven't participated in, but I have strong objections to using iterators as the representations of tables and partitions in the abstraction. This appears to be carried over from the old code, but they now gain additional content and behaviour which make them even less suited to the role. These objects should be Iterable instead. Having that would give clear separation between the iteration process and the entity-level data and behaviour which is currently a cause of multiple problems, including the top two categories above. I also think OpOrder.Group.close() does not belong in an iterator close. To minimize the potential for trouble, grouping/locking should be explicit in the user of the iterator/entity. For the instances where this is not feasible, perhaps the initiating methods could be called "openX" or "startX" vs "X" (e.g. openSearch vs search).
          Hide
          benedict Benedict added a comment -

          I also think OpOrder.Group.close() does not belong in an iterator close.

          This issue is actually much more problematic than I had realised. There are at least two places in the code already where we explicitly hold onto the OpOrder across operations of indeterminate length (disk IO, or peer query responses). During 2i rebuild we (I am told) also hold onto it for the entire duration of a single partition. During a normal read request, if we time out, as far as I can tell we don't even close the Iterator (so we already have a serious bug).

          OpOrder is explicitly not designed for any of these scenarios. Even without the bug, this can cause the entire cluster to lock up for a period because one node is down (and hasn't yet been marked such), or for a node to lock itself up because of either low disk throughput, or one of a rash of bugs we have had recently with tombstone bookkeeping causing heavy CPU consumption, for instance.

          As such I am now totally -1 on leaving OpOrder inside the iterator. Before 3.0 we need to ensure that we eagerly copy any contents we require from the memtable.

          Show
          benedict Benedict added a comment - I also think OpOrder.Group.close() does not belong in an iterator close. This issue is actually much more problematic than I had realised. There are at least two places in the code already where we explicitly hold onto the OpOrder across operations of indeterminate length (disk IO, or peer query responses). During 2i rebuild we (I am told) also hold onto it for the entire duration of a single partition. During a normal read request, if we time out, as far as I can tell we don't even close the Iterator (so we already have a serious bug). OpOrder is explicitly not designed for any of these scenarios. Even without the bug, this can cause the entire cluster to lock up for a period because one node is down (and hasn't yet been marked such), or for a node to lock itself up because of either low disk throughput, or one of a rash of bugs we have had recently with tombstone bookkeeping causing heavy CPU consumption, for instance. As such I am now totally -1 on leaving OpOrder inside the iterator. Before 3.0 we need to ensure that we eagerly copy any contents we require from the memtable.
          Hide
          benedict Benedict added a comment -

          It occurs to me that another option, for the read path at least, would be to:

          1. In the case of a single data request (and it being local): immediately transform to the resultset, and store the digest for corroboration;
          2. In the case of a read-repair (or other multiple data requests), delay performing the local read operation until the remote replies have already arrived. This may marginally increase latency, but only on an uncommon codepath.
          Show
          benedict Benedict added a comment - It occurs to me that another option, for the read path at least, would be to: In the case of a single data request (and it being local): immediately transform to the resultset, and store the digest for corroboration; In the case of a read-repair (or other multiple data requests), delay performing the local read operation until the remote replies have already arrived. This may marginally increase latency, but only on an uncommon codepath.
          Hide
          blerer Benjamin Lerer added a comment -

          Reviewed paging (except for the secondary index). It looks good to me.

          Show
          blerer Benjamin Lerer added a comment - Reviewed paging (except for the secondary index). It looks good to me.
          Hide
          beobal Sam Tunnicliffe added a comment -

          Just for the record, there is a bug with paging on secondary index queries, for which I have a patch. I'll make sure the relevant tests are all passing and submit a PR before the end of the week.

          Show
          beobal Sam Tunnicliffe added a comment - Just for the record, there is a bug with paging on secondary index queries, for which I have a patch. I'll make sure the relevant tests are all passing and submit a PR before the end of the week.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Higher level API for Thrift, coordination, counters, read commands is go from me for the purposes of committing this to trunk (OpOrder issues, already mentioned elsewhere, aside).

          Will open separate tickets for minor issues, and for the pony lambda/streams proposal.

          Otherwise consider it a high-level +1-ish, and I suggest committing this to trunk as soon as backward compatibility code is done (without closing the ticket itself just yet).

          Show
          iamaleksey Aleksey Yeschenko added a comment - Higher level API for Thrift, coordination, counters, read commands is go from me for the purposes of committing this to trunk (OpOrder issues, already mentioned elsewhere, aside). Will open separate tickets for minor issues, and for the pony lambda/streams proposal. Otherwise consider it a high-level +1-ish, and I suggest committing this to trunk as soon as backward compatibility code is done (without closing the ticket itself just yet).
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Haven't looked deeply at schema/metadata changes, however, for CASSANDRA-6717 reasons.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Haven't looked deeply at schema/metadata changes, however, for CASSANDRA-6717 reasons.
          Hide
          benedict Benedict added a comment -

          I've pushed a small semantic-changing suggestion for serialization and merging of RTs here

          I'm happy to split this (and further changes) out into a separate ticket, but while this does cross the threshold for discussion/mention, it's actually a pretty small/contained change. Basically, on a RT boundary, instead of issuing a close and open marker, we just issue the new open marker - both during merge and serialization. On read, encountering an open marker when we already have one open for that iterator is treated is a close/open pair. This both reduces storage on disk, especially for large records (where RT markers are both more frequent and, obviously, larger), but also gets rid of the UnfilteredRowIterators.MergedUnfiltered ugliness.

          Show
          benedict Benedict added a comment - I've pushed a small semantic-changing suggestion for serialization and merging of RTs here I'm happy to split this (and further changes) out into a separate ticket, but while this does cross the threshold for discussion/mention, it's actually a pretty small/contained change. Basically, on a RT boundary, instead of issuing a close and open marker, we just issue the new open marker - both during merge and serialization. On read, encountering an open marker when we already have one open for that iterator is treated is a close/open pair. This both reduces storage on disk, especially for large records (where RT markers are both more frequent and, obviously, larger), but also gets rid of the UnfilteredRowIterators.MergedUnfiltered ugliness.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Some update on this. I've pushed a rebased (and squashed because that made it a lot easier to rebase) version in my 8099 branch.

          It's still missing wire backward compatibility (Tyler Hobbs is finishing this so this should be ready hopefully soon). Regarding tests:

          • unit tests are almost green: mostly it remains some failures in the hadoop tests. I could actually use the experience of someone that knows these tests and the code involved as it's not immediately clear to me what this is even doing.
          • dtests still have a fair amount of failure but I've only look at them recently and it's getting down quickly.

          OpOrder

          I think the main problem was that a local read (done through SP.LocalReadRunnable) was potentially keeping a group "open" while waiting on other nodes. I also realized this path meant local reads (the actual read of sstables) were done outside of the {{StorageProxy} methods, and so 1) not on the thread they were supposed to be on and 2) outside of the "timeout" check. I changed this so that a local response actually materialize everything upfront (similarly to what we do today), solving the problems above. This is not perfect and I'm sure we'll improve on this in the future, but that feels like a good enough option initially.

          Regarding moving OpOrder out of close, the only way to do that I can see is be to move it up the stack, in ReadCommandVerbHandler and SP.LocalReadRunnable (as suggested by Brananir above). I'm working on that (I just started and might not have the time to finish today, but it'll be done early monday for sure).

          Branamir's review remarks

          I've integrated fixes for most of the remarks. I discuss the rest below.

          CompactionIterable 125: I doubt index update belongs here, as side effect of iteration. Ideally index should be collected, not updated.

          Though I don't disagree on principle, this is not different from how it's done currently (it's done the same in LazilyCompactRow, but it just happens that the old LazilyCompactedRow has been merged to CompactionIterable (now CompactionIterator) because simplifications of the patch made it unnecessary to have separate classes). Happy to look at cleaning this in a separate ticket however (probably belongs to cleaning the 2ndary index API in practice).

          CompactionIterable 237: Another side effect of iteration that preferably should be handled by writer.

          Maybe, but it's not that simple. Merging (which is done directly by the CompactionIterator) gets rid of empty partitions and more generally we get rid of them as soon as possible. I think that it's the right thing to do as it's easier for the rest of the code, but that means we have to do invalidation in CompactionIterator. Of course, we could special case CompactionIterator to not remove empty partitions and do cache invalidation externally, but I'm not sure it would be cleaner overall (it would somewhat more error-prone). Besides, I could argue that cache invalidation is very much a product of compaction and having it in CompactionIterator is not that bad.

          Validation compaction now uses CompactionIterable and thus has side effects (index & cache removal).

          I've fixed that but I'll note for posterity that as far as I can tell, index removal is done for validation compaction on trunk (and all previous version) due to the use of LazilyCompactedRow. I've still disabled it (for anything that wasn't a true compaction) because I think that's the right thing to do, but that's a difference of this ticket.

          add that there is never content between two corresponding tombstone markers on any iterator.

          That's mentioned in "Dealing with tombstones and shadowed cells". More precisely, that's what "it's part of the contract of an AtomIterator that it must not shadow it's own data" means. But I need to clean up/update the guide so I'll make sure to clarify further while at it.

          These objects should be Iterable instead. Having that would give clear separation between the iteration process and the entity-level data

          Yes, it would be cleaner from that standpoint. And the use of iterators in the first place is indeed largely carried from the existing code, I just hadn't really though of the alternative tbh. I'll try to check next week how easily such change is. That said, I'm not sure the use of iterators directly is that confusing either, so if it turns hairy, I don't think it's worth blocking on this (that is, we can very well change that later).

          Show
          slebresne Sylvain Lebresne added a comment - Some update on this. I've pushed a rebased (and squashed because that made it a lot easier to rebase) version in my 8099 branch . It's still missing wire backward compatibility ( Tyler Hobbs is finishing this so this should be ready hopefully soon). Regarding tests: unit tests are almost green: mostly it remains some failures in the hadoop tests. I could actually use the experience of someone that knows these tests and the code involved as it's not immediately clear to me what this is even doing. dtests still have a fair amount of failure but I've only look at them recently and it's getting down quickly. OpOrder I think the main problem was that a local read (done through SP.LocalReadRunnable ) was potentially keeping a group "open" while waiting on other nodes. I also realized this path meant local reads (the actual read of sstables) were done outside of the {{StorageProxy} methods, and so 1) not on the thread they were supposed to be on and 2) outside of the "timeout" check. I changed this so that a local response actually materialize everything upfront (similarly to what we do today), solving the problems above. This is not perfect and I'm sure we'll improve on this in the future, but that feels like a good enough option initially. Regarding moving OpOrder out of close , the only way to do that I can see is be to move it up the stack, in ReadCommandVerbHandler and SP.LocalReadRunnable (as suggested by Brananir above). I'm working on that (I just started and might not have the time to finish today, but it'll be done early monday for sure). Branamir's review remarks I've integrated fixes for most of the remarks. I discuss the rest below. CompactionIterable 125 : I doubt index update belongs here, as side effect of iteration. Ideally index should be collected, not updated. Though I don't disagree on principle, this is not different from how it's done currently (it's done the same in LazilyCompactRow , but it just happens that the old LazilyCompactedRow has been merged to CompactionIterable (now CompactionIterator ) because simplifications of the patch made it unnecessary to have separate classes). Happy to look at cleaning this in a separate ticket however (probably belongs to cleaning the 2ndary index API in practice). CompactionIterable 237 : Another side effect of iteration that preferably should be handled by writer. Maybe, but it's not that simple. Merging (which is done directly by the CompactionIterator ) gets rid of empty partitions and more generally we get rid of them as soon as possible. I think that it's the right thing to do as it's easier for the rest of the code, but that means we have to do invalidation in CompactionIterator . Of course, we could special case CompactionIterator to not remove empty partitions and do cache invalidation externally, but I'm not sure it would be cleaner overall (it would somewhat more error-prone). Besides, I could argue that cache invalidation is very much a product of compaction and having it in CompactionIterator is not that bad. Validation compaction now uses CompactionIterable and thus has side effects (index & cache removal). I've fixed that but I'll note for posterity that as far as I can tell, index removal is done for validation compaction on trunk (and all previous version) due to the use of LazilyCompactedRow . I've still disabled it (for anything that wasn't a true compaction) because I think that's the right thing to do, but that's a difference of this ticket. add that there is never content between two corresponding tombstone markers on any iterator. That's mentioned in "Dealing with tombstones and shadowed cells". More precisely, that's what "it's part of the contract of an AtomIterator that it must not shadow it's own data" means. But I need to clean up/update the guide so I'll make sure to clarify further while at it. These objects should be Iterable instead. Having that would give clear separation between the iteration process and the entity-level data Yes, it would be cleaner from that standpoint. And the use of iterators in the first place is indeed largely carried from the existing code, I just hadn't really though of the alternative tbh. I'll try to check next week how easily such change is. That said, I'm not sure the use of iterators directly is that confusing either, so if it turns hairy, I don't think it's worth blocking on this (that is, we can very well change that later).
          Hide
          slebresne Sylvain Lebresne added a comment -

          I've pushed a small semantic-changing suggestion for serialization and merging of RT

          Thanks. I hesitated doing this initially and don't remember why I didn't. But this does clean up things a bit so I'll look at integrating it on Monday unless I remember a good reason not to (which there probably isn't).

          Show
          slebresne Sylvain Lebresne added a comment - I've pushed a small semantic-changing suggestion for serialization and merging of RT Thanks. I hesitated doing this initially and don't remember why I didn't. But this does clean up things a bit so I'll look at integrating it on Monday unless I remember a good reason not to (which there probably isn't).
          Hide
          benedict Benedict added a comment -

          Yes, it would be cleaner from that standpoint. And the use of iterators in the first place is indeed largely carried from the existing code, I just hadn't really though of the alternative tbh. I'll try to check next week how easily such change is. That said, I'm not sure the use of iterators directly is that confusing either, so if it turns hairy, I don't think it's worth blocking on this (that is, we can very well change that later).

          It does change the semantics quite a bit, since the state needed for iterating must be constructed again each time, and is likely constructed in the caller of .iterator(). This has both advantages and disadvantages. One advantage of an Iterator, though, is that you cannot (easily) iterate over its contents twice. I'm personally not so upset at the use of Iterator, since it's a continuation of the existing approach, and Java 8 makes working with iterators a little easier. We can, for instance, make use of the forEachRemaining() method, or otherwise transform the iterator. I don't think there's any increased ugliness inherent in exposing the higher-level information in the Iterator, though.

          I believe Aleksey Yeschenko is working on a way to integrate the Java Streams API at some point in the future, which may lead to other benefits that Iterable cannot deliver.

          Either way, I think getting this ticket in sooner than later is better, and we can focus on how we might make the Iterator abstraction a little nicer in a follow up.

          Show
          benedict Benedict added a comment - Yes, it would be cleaner from that standpoint. And the use of iterators in the first place is indeed largely carried from the existing code, I just hadn't really though of the alternative tbh. I'll try to check next week how easily such change is. That said, I'm not sure the use of iterators directly is that confusing either, so if it turns hairy, I don't think it's worth blocking on this (that is, we can very well change that later). It does change the semantics quite a bit, since the state needed for iterating must be constructed again each time, and is likely constructed in the caller of .iterator(). This has both advantages and disadvantages. One advantage of an Iterator, though, is that you cannot (easily) iterate over its contents twice. I'm personally not so upset at the use of Iterator, since it's a continuation of the existing approach, and Java 8 makes working with iterators a little easier. We can, for instance, make use of the forEachRemaining() method, or otherwise transform the iterator. I don't think there's any increased ugliness inherent in exposing the higher-level information in the Iterator, though. I believe Aleksey Yeschenko is working on a way to integrate the Java Streams API at some point in the future, which may lead to other benefits that Iterable cannot deliver. Either way, I think getting this ticket in sooner than later is better, and we can focus on how we might make the Iterator abstraction a little nicer in a follow up.
          Hide
          blambov Branimir Lambov added a comment -

          The changes look good, I accept that clarifying the side effects and switching to Iterable is not something that needs to be taken care of as part of this ticket.

          On the range tombstones doc, I'm sorry, I totally failed to explain what I had in mind. I was looking for a statement in ### Dealing with tombstones and shadowed cells that says an open marker is immediately followed by the corresponding close marker. This is a simple and easy to check statement which is equivalent to having both "does not shadow its own data" (pt 1) and "there is at most one active tombstone at any point" (pt 4).

          However, to clarify this I went and looked further into how tombstones work in the new code and after looking at it I do not think the merging code could work correctly. To make certain I wrote a partition merge test. The test results are very broken, for example:

          Merging
          6<=[34] <8 8 19<=[99] <=36 40 47 50<[56] <66 67 68 72<=[26] <=73 89<=[97] <97 
          2<=[66] <19 19 34<=[58] <=35 36 40<[94] <41 42<[26] <=48 55<=[35] <=57 58 83 88 
          5 8 19 31<=[49] <=31 37<=[70] <=44 46<[79] <55 65 72<[57] <85 92<[45] <93 93 
          results in
          2<=[66] <19 19<=[99] <=36 37<=[70] 40<[94] 41<=[70] 41<=[70] 44<=[26] 44<=[26] 46<[79] 55<=[56] 55<=[56] <66 67 68 72<=[26] 72<[57] <85 88 89<=[97] <97 
          java.lang.AssertionError: 40<[94] follows another open marker 37<=[70]
          java.lang.AssertionError: Duplicate marker 41<=[70]
          java.lang.AssertionError: Deletion time mismatch for position 44 expected:<deletedAt=70, localDeletion=70> but was:<deletedAt=26, localDeletion=26>
          
          Merging
          4 6 13<=[62] <=26 34<=[89] <=34 47 48<[99] <52 54<=[37] <=57 78<[6] <=83 85 88<=[34] <91 
          0 20 33<=[58] <=33 37<=[84] <40 52 57 77 77 88<[14] <91 92<[15] <96 
          1 8 31 41<[17] <43 62<=[25] <=67 85 87 88 92 97 
          results in
          0 1 4 6 8 13<=[62] <=26 31 33<=[58] <=33 34<=[89] <=34 37<=[84] <40 41<[17] <43 47 48<[99] <52 52 54<=[37] <=57 62<=[25] <=67 77 77 78<[6] <=83 87 88<=[34] <91 <91 92 92<[15] <96 97 
          java.lang.AssertionError: <91 should be preceded by open marker, found <91
          java.lang.AssertionError: Deletion time mismatch for open 88<=[34] and close <91 expected:<deletedAt=34, localDeletion=34> but was:<deletedAt=14, localDeletion=14>
          

          (where x>[z] means tombstone open marker at x with time z and <y stands for close marker at y).

          This is also broken with Benedict's implicit close markers, and his patch is not sufficient to fix it. The test does not even go far enough, as it does not include rows that are within the span of a tombstone but newer (as far as I can tell such a row should expand into three Unfiltered's: a close marker, row with deletion time, and open marker).

          Am I somehow completely misunderstanding or missing some assumptions about the way tombstone ranges are supposed to work? Is there something very wrong in the way I'm doing this test?

          Show
          blambov Branimir Lambov added a comment - The changes look good, I accept that clarifying the side effects and switching to Iterable is not something that needs to be taken care of as part of this ticket. On the range tombstones doc, I'm sorry, I totally failed to explain what I had in mind. I was looking for a statement in ### Dealing with tombstones and shadowed cells that says an open marker is immediately followed by the corresponding close marker. This is a simple and easy to check statement which is equivalent to having both "does not shadow its own data" (pt 1) and "there is at most one active tombstone at any point" (pt 4). However, to clarify this I went and looked further into how tombstones work in the new code and after looking at it I do not think the merging code could work correctly. To make certain I wrote a partition merge test . The test results are very broken, for example: Merging 6<=[34] <8 8 19<=[99] <=36 40 47 50<[56] <66 67 68 72<=[26] <=73 89<=[97] <97 2<=[66] <19 19 34<=[58] <=35 36 40<[94] <41 42<[26] <=48 55<=[35] <=57 58 83 88 5 8 19 31<=[49] <=31 37<=[70] <=44 46<[79] <55 65 72<[57] <85 92<[45] <93 93 results in 2<=[66] <19 19<=[99] <=36 37<=[70] 40<[94] 41<=[70] 41<=[70] 44<=[26] 44<=[26] 46<[79] 55<=[56] 55<=[56] <66 67 68 72<=[26] 72<[57] <85 88 89<=[97] <97 java.lang.AssertionError: 40<[94] follows another open marker 37<=[70] java.lang.AssertionError: Duplicate marker 41<=[70] java.lang.AssertionError: Deletion time mismatch for position 44 expected:<deletedAt=70, localDeletion=70> but was:<deletedAt=26, localDeletion=26> Merging 4 6 13<=[62] <=26 34<=[89] <=34 47 48<[99] <52 54<=[37] <=57 78<[6] <=83 85 88<=[34] <91 0 20 33<=[58] <=33 37<=[84] <40 52 57 77 77 88<[14] <91 92<[15] <96 1 8 31 41<[17] <43 62<=[25] <=67 85 87 88 92 97 results in 0 1 4 6 8 13<=[62] <=26 31 33<=[58] <=33 34<=[89] <=34 37<=[84] <40 41<[17] <43 47 48<[99] <52 52 54<=[37] <=57 62<=[25] <=67 77 77 78<[6] <=83 87 88<=[34] <91 <91 92 92<[15] <96 97 java.lang.AssertionError: <91 should be preceded by open marker, found <91 java.lang.AssertionError: Deletion time mismatch for open 88<=[34] and close <91 expected:<deletedAt=34, localDeletion=34> but was:<deletedAt=14, localDeletion=14> (where x>[z] means tombstone open marker at x with time z and <y stands for close marker at y ). This is also broken with Benedict's implicit close markers, and his patch is not sufficient to fix it. The test does not even go far enough, as it does not include rows that are within the span of a tombstone but newer (as far as I can tell such a row should expand into three Unfiltered's: a close marker, row with deletion time, and open marker). Am I somehow completely misunderstanding or missing some assumptions about the way tombstone ranges are supposed to work? Is there something very wrong in the way I'm doing this test?
          Hide
          slebresne Sylvain Lebresne added a comment -

          I'll look more closely at your test and fix any brokenness: it does seem the results are not what they are supposed to be.

          For the record however, I'll note that it's not true that "an open marker is immediately followed by the corresponding close marker", there can be some rows between an open and a close marker. However, the guarantee that iterators should provide is that those rows between an open and close marker are not deleted by the range tombstone (this doesn't make the tests result above any more right, but wanted to clarify).

          Show
          slebresne Sylvain Lebresne added a comment - I'll look more closely at your test and fix any brokenness: it does seem the results are not what they are supposed to be. For the record however, I'll note that it's not true that "an open marker is immediately followed by the corresponding close marker", there can be some rows between an open and a close marker. However, the guarantee that iterators should provide is that those rows between an open and close marker are not deleted by the range tombstone (this doesn't make the tests result above any more right, but wanted to clarify).
          Hide
          blambov Branimir Lambov added a comment -

          it's not true that "an open marker is immediately followed by the corresponding close marker", there can be some rows between an open and a close marker. However, the guarantee that iterators should provide is that those rows between an open and close marker are not deleted by the range tombstone.

          It would be great to have this clarification in the doc. This is a point that I was missing; I will post a new version of the test that takes this into account.

          Show
          blambov Branimir Lambov added a comment - it's not true that "an open marker is immediately followed by the corresponding close marker", there can be some rows between an open and a close marker. However, the guarantee that iterators should provide is that those rows between an open and close marker are not deleted by the range tombstone. It would be great to have this clarification in the doc. This is a point that I was missing; I will post a new version of the test that takes this into account.
          Hide
          blambov Branimir Lambov added a comment -

          I pushed the updated test in this branch, adding a test of more that one merge iteration. The branch also includes Benedict's implicit close markers patch and a small fix that makes it pass most of the validation. One problem left is that the merge produces duplicate open markers, which with the right interpretation could be acceptable after a single step, but will cause problems after multiple compaction iterations.

          Show
          blambov Branimir Lambov added a comment - I pushed the updated test in this branch , adding a test of more that one merge iteration. The branch also includes Benedict's implicit close markers patch and a small fix that makes it pass most of the validation. One problem left is that the merge produces duplicate open markers, which with the right interpretation could be acceptable after a single step, but will cause problems after multiple compaction iterations.
          Hide
          slebresne Sylvain Lebresne added a comment -

          As shown by Branimir's test, range tombstone merging was indeed broken and I've pushed the fix for that. I've included the old version test in question (with some ugly modification so it tests the reverse case), but I'll look at updating it for the more generic version unless you want to have a shot at it.

          It would be great to have this clarification in the doc.

          Yes, sorry about that. I've tried to clarify and add to the comment in the code to start with. I'll update the "guide" to make that clearer when I have a bit more time.

          I've pushed a small semantic-changing suggestion for serialization and merging of RT

          I hesitated doing this initially and don't remember why I didn't

          I remember now.

          The problem is reverse queries: we need to be able to merge iterator in reverse order. And if we re-use a "start" marker as "updates" of the deletion, we won't know when reversed and we see a "start" marker if that's a real start or an update. Plus we do need both deletion time at a range boundary (the one of the range we left for reverse queries, and the one of the range we enter for forward ones). In all fairness though, I did had forgotten to actually handle the reverse case in the merger, so I've fixed that, but said fix is trivial in the current approach.

          Now, there is obviously possible alternatives for dealing with RTs, but I feel that the current approach has a bunch of good properties:

          1. it's a simple model: every RT has a begining followed by an end and that's it (no overlapping, no inclusions of ranges, very predictible).
          2. it works in both forward and reverse order, and in an obvious way.
          3. it makes the purging of gcable RTs trivial (you just blindly collect any gcable marker). This is something that was broken by the alternative patch in particular and would require some care.
          4. it reuse slice bounds without adding a new concept, which is nice I think.
            So while other options are up for discussions, I think there is enough parameters to consider that I'd prefer such potential discussion to happen in a separate ticket.

          I'll note that the "re-imaging" of markers at the beginning of index blocks is actually not necessary and something I forgot to remove. We now store in each index block if there is an open marker at the end of that block (primary so that we can decide if a sstable don't have any data for a given slice just from the index) making this redundant. So I've removed it.

          Lastly, I fully agree that UnfilteredRowIterators.MergedUnfiltered is ugly and I meant it as a simple temporary solution: I'm sure we can find cleaner alternatives (maybe through some modification of the MergeIterator API so it's easy to produce more than one result for a given key).

          Show
          slebresne Sylvain Lebresne added a comment - As shown by Branimir's test, range tombstone merging was indeed broken and I've pushed the fix for that. I've included the old version test in question (with some ugly modification so it tests the reverse case), but I'll look at updating it for the more generic version unless you want to have a shot at it. It would be great to have this clarification in the doc. Yes, sorry about that. I've tried to clarify and add to the comment in the code to start with. I'll update the "guide" to make that clearer when I have a bit more time. I've pushed a small semantic-changing suggestion for serialization and merging of RT I hesitated doing this initially and don't remember why I didn't I remember now. The problem is reverse queries: we need to be able to merge iterator in reverse order. And if we re-use a "start" marker as "updates" of the deletion, we won't know when reversed and we see a "start" marker if that's a real start or an update. Plus we do need both deletion time at a range boundary (the one of the range we left for reverse queries, and the one of the range we enter for forward ones). In all fairness though, I did had forgotten to actually handle the reverse case in the merger, so I've fixed that, but said fix is trivial in the current approach. Now, there is obviously possible alternatives for dealing with RTs, but I feel that the current approach has a bunch of good properties: it's a simple model: every RT has a begining followed by an end and that's it (no overlapping, no inclusions of ranges, very predictible). it works in both forward and reverse order, and in an obvious way. it makes the purging of gcable RTs trivial (you just blindly collect any gcable marker). This is something that was broken by the alternative patch in particular and would require some care. it reuse slice bounds without adding a new concept, which is nice I think. So while other options are up for discussions, I think there is enough parameters to consider that I'd prefer such potential discussion to happen in a separate ticket. I'll note that the "re-imaging" of markers at the beginning of index blocks is actually not necessary and something I forgot to remove. We now store in each index block if there is an open marker at the end of that block (primary so that we can decide if a sstable don't have any data for a given slice just from the index) making this redundant. So I've removed it. Lastly, I fully agree that UnfilteredRowIterators.MergedUnfiltered is ugly and I meant it as a simple temporary solution: I'm sure we can find cleaner alternatives (maybe through some modification of the MergeIterator API so it's easy to produce more than one result for a given key).
          Hide
          benedict Benedict added a comment -

          I did had forgotten to actually handle the reverse case in the merger

          This is something that was broken by the alternative patch in particular and would require some care.

          Do we have test coverage of these? I didn't notice them breaking

          From a nomenclature standpoint, I would like to suggest we relabel the marker types to (something like) UPPER/LOWER, so that internally we can refer to either as an open marker without confusion. Right now, the concept is blurred, because we treat close markers as open markers when operating in reverse, and IMO this hinders clarity.

          On the matter of the method for dealing with RTs, I agree with the majority of your points. However to improve the ugliness of MergedUnfiltered and remove the double persistence, why not introduce a special kind of RTM, of type BOUNDARY, which just has two timestamps. That's basically what the MergedUnfiltered is, so why not save space and improve clarity by promoting it to a first class concept? Unpicking them at merge when performing GC shouldn't be onerous.

          I'm dubious about introducing more method calls to be invoked on every Cell, to permit the rare case of two atoms after one merge result. That's a code complexity and execution cost incurred for the uncommon case, but paid by all.

          Show
          benedict Benedict added a comment - I did had forgotten to actually handle the reverse case in the merger This is something that was broken by the alternative patch in particular and would require some care. Do we have test coverage of these? I didn't notice them breaking From a nomenclature standpoint, I would like to suggest we relabel the marker types to (something like) UPPER/LOWER , so that internally we can refer to either as an open marker without confusion. Right now, the concept is blurred, because we treat close markers as open markers when operating in reverse, and IMO this hinders clarity. On the matter of the method for dealing with RTs, I agree with the majority of your points. However to improve the ugliness of MergedUnfiltered and remove the double persistence, why not introduce a special kind of RTM, of type BOUNDARY , which just has two timestamps. That's basically what the MergedUnfiltered is, so why not save space and improve clarity by promoting it to a first class concept? Unpicking them at merge when performing GC shouldn't be onerous. I'm dubious about introducing more method calls to be invoked on every Cell, to permit the rare case of two atoms after one merge result. That's a code complexity and execution cost incurred for the uncommon case, but paid by all.
          Hide
          slebresne Sylvain Lebresne added a comment -

          why not introduce a special kind of RTM, of type BOUNDARY

          Because that doesn't really fit the part about reusing "slice bounds without adding a new concept". A RT is a slice of deletion, and I think that's rather clean. I strongly suspect that either not reusing slice for RTs, adding a BOUNDARY concept to Slice.BOUND (which doesn't really make sense for slices per-se), or some other hack to work around this will make things more confusing/complex, and is thus not worth a minor optimization of a probably pretty rare situation in practice. Feel free to give the changes a shot though if you're convinced otherwise, and we can have a more informed discussion on the result.

          Do we have test coverage of these?

          We have generally poor coverage of range tombstone usage (though I've modified Branamir's test to test the reverse case too, so we have one test for that now at least). We have some basic tests, but nothing fancy enough. I've created CASSANDRA-9617 to improve this.

          I would like to suggest we relabel the marker types to (something like) UPPER/LOWER

          We can, though due to the point above that means also renaming it for slices in general. Which is not crazy per-se in that we do scan slices from end to start for reverse queries so they are equivalent in that respect, it's just that it's a departure of the existing nomenclature (which I don't personally mind).

          Show
          slebresne Sylvain Lebresne added a comment - why not introduce a special kind of RTM, of type BOUNDARY Because that doesn't really fit the part about reusing "slice bounds without adding a new concept". A RT is a slice of deletion, and I think that's rather clean. I strongly suspect that either not reusing slice for RTs, adding a BOUNDARY concept to Slice.BOUND (which doesn't really make sense for slices per-se), or some other hack to work around this will make things more confusing/complex, and is thus not worth a minor optimization of a probably pretty rare situation in practice. Feel free to give the changes a shot though if you're convinced otherwise, and we can have a more informed discussion on the result. Do we have test coverage of these? We have generally poor coverage of range tombstone usage (though I've modified Branamir's test to test the reverse case too, so we have one test for that now at least). We have some basic tests, but nothing fancy enough. I've created CASSANDRA-9617 to improve this. I would like to suggest we relabel the marker types to (something like) UPPER/LOWER We can, though due to the point above that means also renaming it for slices in general. Which is not crazy per-se in that we do scan slices from end to start for reverse queries so they are equivalent in that respect, it's just that it's a departure of the existing nomenclature (which I don't personally mind).
          Hide
          blambov Branimir Lambov added a comment - - edited

          Pushed updated test. The code is much better now, the equivalence test passes even for larger tests. There's one more problem to fix, though, it generates invalid empty ranges violating the ordering of the iterator, for example:

          Seed 10
          Merging
          14<=[8] <15[8] 23 46 56<=[93] <69[93] 80 81 88<=[70] <90[70] 90 93<=[72] <=95[72] 97 
          10 12 33 37<=[71] <39[71] 39<=[8] <=44[8] 55<[14] <=57[14] 68 92<=[45] <98[45] 98<=[80] <=98[80] 99 
          2 8 24 33<=[7] <45[7] 51 62 63 63 79<=[71] <=79[71] 81<=[16] <=81[16] 
          results in
          2 8 10 12 14<=[8] <15[8] 23 24 33<=[7] 33 <37[7] 37<=[71] <39[71] 39<=[7] <39[7] 39<=[8] <=44[8] 44<[7] <45[7] 46 51 55<[14] <56[14] 56<=[93] 68 <69[93] 79<=[71] <=79[71] 80 81<=[16] <=81[16] 88<=[70] <90[70] 90 92<=[45] <93[45] 93<=[72] <=95[72] 95<[45] <98[45] 98<=[80] <=98[80] 99 
          java.lang.AssertionError: 39<=[7] not ordered before <39[7]
          
          Show
          blambov Branimir Lambov added a comment - - edited Pushed updated test . The code is much better now, the equivalence test passes even for larger tests. There's one more problem to fix, though, it generates invalid empty ranges violating the ordering of the iterator, for example: Seed 10 Merging 14<=[8] <15[8] 23 46 56<=[93] <69[93] 80 81 88<=[70] <90[70] 90 93<=[72] <=95[72] 97 10 12 33 37<=[71] <39[71] 39<=[8] <=44[8] 55<[14] <=57[14] 68 92<=[45] <98[45] 98<=[80] <=98[80] 99 2 8 24 33<=[7] <45[7] 51 62 63 63 79<=[71] <=79[71] 81<=[16] <=81[16] results in 2 8 10 12 14<=[8] <15[8] 23 24 33<=[7] 33 <37[7] 37<=[71] <39[71] 39<=[7] <39[7] 39<=[8] <=44[8] 44<[7] <45[7] 46 51 55<[14] <56[14] 56<=[93] 68 <69[93] 79<=[71] <=79[71] 80 81<=[16] <=81[16] 88<=[70] <90[70] 90 92<=[45] <93[45] 93<=[72] <=95[72] 95<[45] <98[45] 98<=[80] <=98[80] 99 java.lang.AssertionError: 39<=[7] not ordered before <39[7]
          Hide
          blambov Branimir Lambov added a comment -

          Thinking about this a bit more, I see that this is very difficult to fix. When the reducer issues a pair one of the markers is out of its place in the stream, thus we would need to delay the stream to be able to place it correctly. This would have a non-trivial performance impact.

          Instead, I think we should officially permit this kind of disorder (e.g. <39[71] 39<=[7] <39[7] 39<=[8] from above where 39<=[7] <39[7] is invalid and covered by the outer pair of markers) in the unfiltered stream and remove the invalid ranges in the compaction writer. The merge algorithm is able to deal with such ranges correctly and filtering just removes them. We have to document it well and make sure the relevant code is tested with examples of this.

          Even without removal in the compaction writer, the only kind of trouble I can see such a range introducing is seeking to the wrong marker in a binary / indexed search, but this should be ok as the correct marker is certain to follow before any live data.

          Show
          blambov Branimir Lambov added a comment - Thinking about this a bit more, I see that this is very difficult to fix. When the reducer issues a pair one of the markers is out of its place in the stream, thus we would need to delay the stream to be able to place it correctly. This would have a non-trivial performance impact. Instead, I think we should officially permit this kind of disorder (e.g. <39[71] 39<=[7] <39[7] 39<=[8] from above where 39<=[7] <39[7] is invalid and covered by the outer pair of markers) in the unfiltered stream and remove the invalid ranges in the compaction writer. The merge algorithm is able to deal with such ranges correctly and filtering just removes them. We have to document it well and make sure the relevant code is tested with examples of this. Even without removal in the compaction writer, the only kind of trouble I can see such a range introducing is seeking to the wrong marker in a binary / indexed search, but this should be ok as the correct marker is certain to follow before any live data.
          Hide
          slebresne Sylvain Lebresne added a comment -

          You're right, it's not easy to fix at all. However, allowing such disorder "officially" crosses the boundaries of my comfort zone, even if it could work as of the current code.

          So I'm starting to warm up to the idea of introducing new "boundary" types of marker (which we'll need 2 of: excl_close-incl_open and incl_close-excl_close). Provided we make sure iterators must use those boundary when appropriate, I think we dodge that problem. It will complicate things somewhat but giving the other benefits discussed above, maybe that's the better solution.

          Show
          slebresne Sylvain Lebresne added a comment - You're right, it's not easy to fix at all. However, allowing such disorder "officially" crosses the boundaries of my comfort zone, even if it could work as of the current code. So I'm starting to warm up to the idea of introducing new "boundary" types of marker (which we'll need 2 of: excl_close-incl_open and incl_close-excl_close). Provided we make sure iterators must use those boundary when appropriate, I think we dodge that problem. It will complicate things somewhat but giving the other benefits discussed above, maybe that's the better solution.
          Hide
          blambov Branimir Lambov added a comment -

          Continuing my review with the db.filter package, apart from ColumnFilter.Expression.Serializer.serializedSize which appears to be missing a byte when the messaging version is 3.0, I see no correctness problems with the code. More unit testing would definitely help, though. Below are some suggestions that should not be treated as blocking commit.

          PartitionFilter:

          • The name implies it is something that filters partitions, in fact it applies to rows within partition. I'd rename this to RowFilter or, if there's a reason to keep partition in its name, PartitionContentFilter.
          • forPaging javadoc is contradictory-- on one hand strictly after, on the other could be inclusive. Please clarify.
          • filter(UnfilteredRowIterator) is risky, people will call the wrong filter overload by mistake. Rename?
          • filter(SliceableUnfilteredRowIterator) does not seem to apply queriedColumns in either NamesPartitionFilter or SlicePartitionFilter. Is this on purpose? If it is, the method above is not its counterpart. Please document.

          NamesPartitionFilter:

          • clusterings should be NavigableSet to enable reverse traversal without duplication (TreeSet implements that too). This simplifies other code in file.
          • toString misses a closing bracket.

          SliceableUnfilteredRowIterator:

          • JavaDoc: what is 'seekTo'? 'slice'?

          ColumnSubselection:

          • ELT_SELECTION is not easily decipherable. Why not just ELEMENT instead? And 'element' elsewhere? (elt also)

          ColumnSelection:

          • includes/newTester: It appears these should only be called if the columns filter is already applied. This is not obvious; I'd state it in JavaDoc and/or assert that columns.contains(cell.column()).

          DataLimits:

          • DISTINCT_NONE is not the same as CQLLimits.distinct(MAX_VALUE). Would you add a comment to say why it is so?
          • countCells should be countsCells.
          • CQLLimits.isDistinct does not appear to be used for anything.

          Throughout, there are lots of easy-to-fix (just add <?>) raw type warnings. I would also use a virtual method instead of an instance field for kind everywhere, as done in ColumnSubselection.

          Here I've uploaded a branch that includes fixes to a few of the above.

          Show
          blambov Branimir Lambov added a comment - Continuing my review with the db.filter package, apart from ColumnFilter.Expression.Serializer.serializedSize which appears to be missing a byte when the messaging version is 3.0, I see no correctness problems with the code. More unit testing would definitely help, though. Below are some suggestions that should not be treated as blocking commit. PartitionFilter : The name implies it is something that filters partitions, in fact it applies to rows within partition. I'd rename this to RowFilter or, if there's a reason to keep partition in its name, PartitionContentFilter . forPaging javadoc is contradictory-- on one hand strictly after, on the other could be inclusive. Please clarify. filter(UnfilteredRowIterator) is risky, people will call the wrong filter overload by mistake. Rename? filter(SliceableUnfilteredRowIterator) does not seem to apply queriedColumns in either NamesPartitionFilter or SlicePartitionFilter . Is this on purpose? If it is, the method above is not its counterpart. Please document. NamesPartitionFilter : clusterings should be NavigableSet to enable reverse traversal without duplication ( TreeSet implements that too). This simplifies other code in file. toString misses a closing bracket. SliceableUnfilteredRowIterator : JavaDoc: what is 'seekTo'? 'slice'? ColumnSubselection : ELT_SELECTION is not easily decipherable. Why not just ELEMENT instead? And 'element' elsewhere? (elt also) ColumnSelection : includes/newTester : It appears these should only be called if the columns filter is already applied. This is not obvious; I'd state it in JavaDoc and/or assert that columns.contains(cell.column()) . DataLimits : DISTINCT_NONE is not the same as CQLLimits.distinct(MAX_VALUE) . Would you add a comment to say why it is so? countCells should be countsCells . CQLLimits.isDistinct does not appear to be used for anything. Throughout, there are lots of easy-to-fix (just add <?>) raw type warnings. I would also use a virtual method instead of an instance field for kind everywhere, as done in ColumnSubselection . Here I've uploaded a branch that includes fixes to a few of the above.
          Hide
          benedict Benedict added a comment -

          clusterings should be NavigableSet to enable reverse traversal without duplication (TreeSet implements that too). This simplifies other code in file.

          fyi, this change is included in CASSANDRA-9471, along with the necessary propagation of the type change outwards. No harm in doing it in advance of that, but it can also be left until then.

          Show
          benedict Benedict added a comment - clusterings should be NavigableSet to enable reverse traversal without duplication (TreeSet implements that too). This simplifies other code in file. fyi, this change is included in CASSANDRA-9471 , along with the necessary propagation of the type change outwards. No harm in doing it in advance of that, but it can also be left until then.
          Hide
          benedict Benedict added a comment - - edited

          Sylvain Lebresne: what's the state of play with the refactor work? Is it being done in the near future? Trying to figure out if/when I should start making pull requests for the new memtable hierarchy.

          (If it isn't in progress, I'll see about starting the refactor myself and having you vet it instead)

          Show
          benedict Benedict added a comment - - edited Sylvain Lebresne : what's the state of play with the refactor work? Is it being done in the near future? Trying to figure out if/when I should start making pull requests for the new memtable hierarchy. (If it isn't in progress, I'll see about starting the refactor myself and having you vet it instead)
          Hide
          slebresne Sylvain Lebresne added a comment -

          I've started it and plan on focusing on it more exclusively this week. I'll add that I'm quite keen on finishing giving it this first short myself.

          Show
          slebresne Sylvain Lebresne added a comment - I've started it and plan on focusing on it more exclusively this week. I'll add that I'm quite keen on finishing giving it this first short myself.
          Hide
          benedict Benedict added a comment -

          That's all I needed to hear. Thanks!

          Show
          benedict Benedict added a comment - That's all I needed to hear. Thanks!
          Hide
          slebresne Sylvain Lebresne added a comment -

          I've force-pushed a rebased version of the branch (still here). Since my last update, on top of a number of fixes, I've finished moved the OpOrder out of the iterators close and I've update the range tombstone code to used specific boundaries marker as discussed above (I've also included Branamir's branch with it's "nits" and fixed most others). I haven't had the time to upgrade Branamir's test however and so for the sake of compilation I've currently removed it. If you could have a look at rebasing you test Branimir Lambov, that would be very greatly appreciated as you're more familiar with it.

          There is still a number of work to be done on this ticket, but the bulk of it is reasonably stable, and outside of some of the backward compatibility code the branch is generally functional. And we're starting to have tickets that are based on this and are ready (or almost are), tickets that won't be impacted too much by the remaining parts of this (which include the refactoring of the flyweight-based implementation that I'm going to focus on now, the wire backward compatibility code Tyler is working on and some general testing/bug fixing).

          So, based on some offline discussion, I suggest committing the current branch to trunk. I won't close this ticket just yet and continue fixing the remaining things, but it'll allow other tickets to synchronize on this and will generally help get more eyes on this by necessity.

          And I'm planning to commit this tomorrow-ish (my european tomorrow), so if you have a strong objection to this (again, we're not closing the ticket and committing it don't mean it can't change), please speak quickly.

          Show
          slebresne Sylvain Lebresne added a comment - I've force-pushed a rebased version of the branch (still here ). Since my last update, on top of a number of fixes, I've finished moved the OpOrder out of the iterators close and I've update the range tombstone code to used specific boundaries marker as discussed above (I've also included Branamir's branch with it's "nits" and fixed most others). I haven't had the time to upgrade Branamir's test however and so for the sake of compilation I've currently removed it. If you could have a look at rebasing you test Branimir Lambov , that would be very greatly appreciated as you're more familiar with it. There is still a number of work to be done on this ticket, but the bulk of it is reasonably stable, and outside of some of the backward compatibility code the branch is generally functional. And we're starting to have tickets that are based on this and are ready (or almost are), tickets that won't be impacted too much by the remaining parts of this (which include the refactoring of the flyweight-based implementation that I'm going to focus on now, the wire backward compatibility code Tyler is working on and some general testing/bug fixing). So, based on some offline discussion, I suggest committing the current branch to trunk. I won't close this ticket just yet and continue fixing the remaining things, but it'll allow other tickets to synchronize on this and will generally help get more eyes on this by necessity. And I'm planning to commit this tomorrow-ish (my european tomorrow), so if you have a strong objection to this (again, we're not closing the ticket and committing it don't mean it can't change), please speak quickly.
          Hide
          slebresne Sylvain Lebresne added a comment -

          So as mentioned yesterday, I've merged the current branch on trunk, so further development will happen there.

          PartitionFilter:

          • The name implies it is something that filters partitions, in fact it applies to rows within partition. I'd rename this to RowFilter or, if there's a reason to keep partition in its name, PartitionContentFilter.

          I agree that the naming ain't perfect here. And it's a bit more messy than that, in the sense that the current ColumnFilter really also filter rows and should be called RowFilter, the difference of the PartitionFilter being that it's the part of the filtering that can be handled by the
          "clustering index". Anyway, I've cleaned this up through the follow renames:

          1. PartitionFilter to ClusteringIndexFilter: I'm sure there is other option here, and if someone has a strong preference for an alternative I'm happy to discuss it, but I think it captures pretty well the point of the class.
          2. ColumnFilter to RowFilter, as this the closest thing we have a general filter on rows.
          3. ColumnSelection to ColumnFilter, since after all, that is our true filter on columns.
            I think this is a lot more consistent and precise.
          Show
          slebresne Sylvain Lebresne added a comment - So as mentioned yesterday, I've merged the current branch on trunk, so further development will happen there. PartitionFilter : The name implies it is something that filters partitions, in fact it applies to rows within partition. I'd rename this to RowFilter or, if there's a reason to keep partition in its name, PartitionContentFilter . I agree that the naming ain't perfect here. And it's a bit more messy than that, in the sense that the current ColumnFilter really also filter rows and should be called RowFilter , the difference of the PartitionFilter being that it's the part of the filtering that can be handled by the "clustering index". Anyway, I've cleaned this up through the follow renames: PartitionFilter to ClusteringIndexFilter : I'm sure there is other option here, and if someone has a strong preference for an alternative I'm happy to discuss it, but I think it captures pretty well the point of the class. ColumnFilter to RowFilter , as this the closest thing we have a general filter on rows. ColumnSelection to ColumnFilter , since after all, that is our true filter on columns. I think this is a lot more consistent and precise.
          Hide
          blambov Branimir Lambov added a comment -

          The rebased test is here. The test does not flag the disorder/duplication problem, but I'm afraid we still have it (see testDuplicateRangeCase).

          We need a bit more work to fully fix it, by making complementing open/close/boundary markers equal rather than ordered and making sure there's only one present in each stream. I intend to take care of this in one of the next couple of days.

          Show
          blambov Branimir Lambov added a comment - The rebased test is here . The test does not flag the disorder/duplication problem, but I'm afraid we still have it (see testDuplicateRangeCase ). We need a bit more work to fully fix it, by making complementing open/close/boundary markers equal rather than ordered and making sure there's only one present in each stream. I intend to take care of this in one of the next couple of days.
          Hide
          slebresne Sylvain Lebresne added a comment -

          making complementing open/close/boundary markers equal rather than ordered and making sure there's only one present in each stream

          I was already partly making that, but for some reason I hadn't made all the bounds that should be equal, actually equal. But then changing the last bits was actually reasonably simple so I took the liberty to take a quick shot at it (hopefully you haven't started working on it yourself yet). The result is here and it actually somewhat simplify the merging. I've included your test on the branch and re-enabled the assertion that was failing and testDuplicateRangeCase passes but I noticed 2 things with that test:

          1. it actually generate somewhat invalid inputs in that it can generate duplicate rows. That is, the first "order violation" assumption of verifyValid should really use a strict inequality, but this currently break due to rows (doesn't invalidate the testing of range tombstones, but would be nice to fix it since the tests is more of a general merge test).
          2. testInvalidRangeInput fails but that seems expected to me. I suppose the test is meant to check verifyValid does what it should?
          Show
          slebresne Sylvain Lebresne added a comment - making complementing open/close/boundary markers equal rather than ordered and making sure there's only one present in each stream I was already partly making that, but for some reason I hadn't made all the bounds that should be equal, actually equal. But then changing the last bits was actually reasonably simple so I took the liberty to take a quick shot at it (hopefully you haven't started working on it yourself yet). The result is here and it actually somewhat simplify the merging. I've included your test on the branch and re-enabled the assertion that was failing and testDuplicateRangeCase passes but I noticed 2 things with that test: it actually generate somewhat invalid inputs in that it can generate duplicate rows. That is, the first "order violation" assumption of verifyValid should really use a strict inequality, but this currently break due to rows (doesn't invalidate the testing of range tombstones, but would be nice to fix it since the tests is more of a general merge test). testInvalidRangeInput fails but that seems expected to me. I suppose the test is meant to check verifyValid does what it should?
          Hide
          blambov Branimir Lambov added a comment -

          Excellent, the problem is now solved. Update of the test that removes the no longer necessary testInvalidRangeInput, avoids duplicate rows and checks strict inequality is uploaded here, rebased on trunk.

          Show
          blambov Branimir Lambov added a comment - Excellent, the problem is now solved. Update of the test that removes the no longer necessary testInvalidRangeInput , avoids duplicate rows and checks strict inequality is uploaded here , rebased on trunk.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Branimir Lambov I was about to commit the patch with your test, but it seems the last commit (the one with the test) include a number of unrelated changes (probably from rebase), and I'm not entirely comfortable committing those parts.

          Show
          slebresne Sylvain Lebresne added a comment - Branimir Lambov I was about to commit the patch with your test, but it seems the last commit (the one with the test) include a number of unrelated changes (probably from rebase), and I'm not entirely comfortable committing those parts.
          Hide
          blambov Branimir Lambov added a comment -

          Sorry, these changes were indeed wrong. Uploaded fix to the same branch.

          Show
          blambov Branimir Lambov added a comment - Sorry, these changes were indeed wrong. Uploaded fix to the same branch .
          Hide
          slebresne Sylvain Lebresne added a comment -

          Thanks for the update Branimir Lambov. Since we're good on that I've committed it to trunk, but from now on I'll stick to opening subtasks for things related to this ticket (but will keep this open until we're done with all the subtasks).

          Show
          slebresne Sylvain Lebresne added a comment - Thanks for the update Branimir Lambov . Since we're good on that I've committed it to trunk, but from now on I'll stick to opening subtasks for things related to this ticket (but will keep this open until we're done with all the subtasks).
          Hide
          tzach tzach added a comment -

          looks like this change undoes the fix for CASSANDRA-4858.
          Is this correct? intentional?

          Show
          tzach tzach added a comment - looks like this change undoes the fix for CASSANDRA-4858 . Is this correct? intentional?
          Hide
          slebresne Sylvain Lebresne added a comment -

          It is certainly not the intention to undo CASSANDRA-4858, but I could use some more details on why you think this has been undone. A priori the code from that ticket, which basically merge ranges that can be merged together, is still there in StorageProxy.RangeMerger. So did you experimentally noticed that this wasn't working as intended anymore? And if so, can you share your experiment?

          Show
          slebresne Sylvain Lebresne added a comment - It is certainly not the intention to undo CASSANDRA-4858 , but I could use some more details on why you think this has been undone. A priori the code from that ticket, which basically merge ranges that can be merged together, is still there in StorageProxy.RangeMerger . So did you experimentally noticed that this wasn't working as intended anymore? And if so, can you share your experiment?
          Hide
          tzach tzach added a comment -

          Thanks Sylvain, I missed that.

          Show
          tzach tzach added a comment - Thanks Sylvain, I missed that.
          Hide
          rustyrazorblade Jon Haddad added a comment - - edited

          Is there a spec, written in a single location, that describes the layout of the new sstable structure? Currently it feels like you have to jump between quite a few files to determine how everything laid out.

          I've seen the guide here: https://github.com/apache/cassandra/blob/trunk/guide_8099.md#storage-format-on-disk-and-on-wire but it doesn't address at a byte level how things are stored. It doesn't have to be as intense as http://nmap.org/book/images/hdr/MJB-TCP-Header-800x564.png but I think to allow other people to make meaningful contributions it would be nice to have a doc describing things.

          Show
          rustyrazorblade Jon Haddad added a comment - - edited Is there a spec, written in a single location, that describes the layout of the new sstable structure? Currently it feels like you have to jump between quite a few files to determine how everything laid out. I've seen the guide here: https://github.com/apache/cassandra/blob/trunk/guide_8099.md#storage-format-on-disk-and-on-wire but it doesn't address at a byte level how things are stored. It doesn't have to be as intense as http://nmap.org/book/images/hdr/MJB-TCP-Header-800x564.png but I think to allow other people to make meaningful contributions it would be nice to have a doc describing things.
          Hide
          pierz Pierre N. added a comment -

          Yes it would be great to have a full specification now cassandra has a new sstable format.

          Show
          pierz Pierre N. added a comment - Yes it would be great to have a full specification now cassandra has a new sstable format.
          Hide
          jbellis Jonathan Ellis added a comment -

          That's on Sylvain's list when he gets back in two weeks.

          Show
          jbellis Jonathan Ellis added a comment - That's on Sylvain's list when he gets back in two weeks.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          The code's been in for a while now, and many of the issues have been resolved as separate tickets. Some issues and optimizations haven been files as separate tickets, but not resolved yet.

          As far as this parent ticket goes, with CASSANDRA-9704 inclusion, +1 from me.

          Show
          iamaleksey Aleksey Yeschenko added a comment - The code's been in for a while now, and many of the issues have been resolved as separate tickets. Some issues and optimizations haven been files as separate tickets, but not resolved yet. As far as this parent ticket goes, with CASSANDRA-9704 inclusion, +1 from me.
          Hide
          rajath26 Rajath Subramanyam added a comment - - edited

          Is there any specification online with a detailed description of the new SSTable format byte by byte ? It would be really useful. Right now, it seems like multiple files have to be gone through to completely understand the new format.

          Show
          rajath26 Rajath Subramanyam added a comment - - edited Is there any specification online with a detailed description of the new SSTable format byte by byte ? It would be really useful. Right now, it seems like multiple files have to be gone through to completely understand the new format.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Is there any specification online with a detailed description of the new SSTable format byte by byte ?

          There isn't, and as much as I would be the first to love seeing that exist, I doubt I'll find the time to provide it myself any time soon. Of course, it's open source, so anyone with some free time and willingness could use the code to kick-start such documentation on the wiki. Which would be very much appreciated.

          Show
          slebresne Sylvain Lebresne added a comment - Is there any specification online with a detailed description of the new SSTable format byte by byte ? There isn't, and as much as I would be the first to love seeing that exist, I doubt I'll find the time to provide it myself any time soon. Of course, it's open source, so anyone with some free time and willingness could use the code to kick-start such documentation on the wiki. Which would be very much appreciated.

            People

            • Assignee:
              slebresne Sylvain Lebresne
              Reporter:
              slebresne Sylvain Lebresne
              Reviewer:
              Aleksey Yeschenko
            • Votes:
              3 Vote for this issue
              Watchers:
              76 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development