Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: 4.x
    • Component/s: None

      Description

      Storage Format Proposal

      C* has come a long way over the past few years, and unfortunately our storage format hasn't kept pace with the data models we are now encouraging people to utilise. This ticket proposes a collections of storage primitives that can be combined to serve these data models more optimally.

      It would probably help to first state the data model at the most abstract level. We have a fixed three-tier structure: We have the partition key, the clustering columns, and the data columns. Each have their own characteristics and so require their own specialised treatment.

      I should note that these changes will necessarily be delivered in stages, and that we will be making some assumptions about what the most useful features to support initially will be. Any features not supported will require sticking with the old format until we extend support to all C* functionality.

      Partition Key

      • This really has two components: the partition, and the value. Although the partition is primarily used to distribute across nodes, it can also be used to optimise lookups for a given key within a node
      • Generally partitioning is by hash, and for the moment I want to focus this ticket on the assumption that this is the case
      • Given this, it makes sense to optimise our storage format to permit O(1) searching of a given partition. It may be possible to achieve this with little overhead based on the fact we store the hashes in order and know they are approximately randomly distributed, as this effectively forms an immutable contiguous split-ordered list (see Shalev/Shavit, or CASSANDRA-7282), so we only need to store an amount of data based on how imperfectly distributed the hashes are, or at worst a single value per block.
      • This should completely obviate the need for a separate key-cache, which will be relegated to supporting the old storage format only

      Primary Key / Clustering Columns

      • Given we have a hierarchical data model, I propose the use of a cache-oblivious trie
      • The main advantage of the trie is that it is extremely compact and supports optimally efficient merges with other tries so that we can support more efficient reads when multiple sstables are touched
      • The trie will be preceded by a small amount of related data; the full partition key, a timestamp epoch (for offset-encoding timestamps) and any other partition level optimisation data, such as (potentially) a min/max timestamp to abort merges earlier
      • Initially I propose to limit the trie to byte-order comparable data types only (the number of which we can expand through translations of the important types that are not currently)
      • Crucially the trie will also encapsulate any range tombstones, so that these are merged early in the process and avoids re-iterating the same data
      • Results in true bidirectional streaming without having to read entire range into memory

      Values

      There are generally two approaches to storing rows of data: columnar, or row-oriented. The above two data structures can be combined with a value storage scheme that is based on either. However, given the current model we have of reading large 64Kb blocks for any read, I am inclined to focus on columnar support first, as this delivers order-of-magnitude benefits to those users with the correct workload, while for most workloads our 64Kb blocks are large enough to store row-oriented data in a column-oriented fashion without any performance degradation (I'm happy to consign very large row support to phase 2).

      Since we will most likely target both behaviours eventually, I am currently inclined to suggest that static columns, sets and maps be targeted for a row-oriented release, as they don't naturally fit in a columnar layout without secondary heap-blocks. This may be easier than delivering heap-blocks also, as it keeps both implementations relatively clean. This is certainly open to debate, and I have no doubt there will be conflicting opinions here.

      Focusing on our columnar layout, the goals are:

      • Support sparse and dense column storage
      • Efficient compression of tombstones, timestamps, ttls, etc. into near-zero space based on offset/delta/bitmap encoding
      • Normalisation of column names once per file
      • Per-file block-layout index, defining how each block's data is encoded, so we can index directly within a block for dense fields (permitting more efficient page cache utilisation)
      • Configurable grouping of fields per block, so that we can get closer to row-oriented or column-oriented performance, based on user goals

      I have attached my slides from the ngcc for reference.

      1. ngcc-storage.odp
        705 kB
        Benedict
      2. storage_format.pdf
        359 kB
        Benedict

        Issue Links

          Activity

          Hide
          snazy Robert Stupp added a comment -

          Some notes from my side (I do not have in-depth knowledge how sstables work - so maybe some points are foolish):

          • CASSANDRA-7443 is good to clean up the code base and should be done first
          • Using 32bit instead of 64bit pointers could also save some space. Two variants:
            • "Byte granular" 32bit pointers for files up to 4G or
            • "compressed" 32bit pointers (basically what Java's CompressedOOPS does) if we know that blocks can only begin at certain offsets - would move the "32bit barrier" to 64g files for 16 byte boundaries
          • Trie + byte ordered types: would this mean to do some "special serialization" e.g. for timeuuid to make them "binary comparable"?
          • If one partition only contains one row, "plain" row-oriented storage seems to be more efficient. Is this what "small partition layout" is meant for?
          • Column names (CQL): I'd prefer to extend the table definition schema with an integer column id and use that. Could save lots of String.hashCode/equals() calls - even if the column-id is also used in native protocol. (Think this was discussed elsewhere)
          • Bikeshed: Is the term "sstable" still correct?

          I didn't catch the point why only maps and sets don't naturally fit into columnar format but lists, strings and blobs do. Or is it just because of their mean serialized size?

          If I understood the columnar concept correctly, it works best, if a whole partition fits in one (64k) block. The more additional (heap-)blocks one partition needs, the more efficient is row-oriented format. Correct?

          Show
          snazy Robert Stupp added a comment - Some notes from my side (I do not have in-depth knowledge how sstables work - so maybe some points are foolish): CASSANDRA-7443 is good to clean up the code base and should be done first Using 32bit instead of 64bit pointers could also save some space. Two variants: "Byte granular" 32bit pointers for files up to 4G or "compressed" 32bit pointers (basically what Java's CompressedOOPS does) if we know that blocks can only begin at certain offsets - would move the "32bit barrier" to 64g files for 16 byte boundaries Trie + byte ordered types: would this mean to do some "special serialization" e.g. for timeuuid to make them "binary comparable"? If one partition only contains one row, "plain" row-oriented storage seems to be more efficient. Is this what "small partition layout" is meant for? Column names (CQL): I'd prefer to extend the table definition schema with an integer column id and use that. Could save lots of String.hashCode/equals() calls - even if the column-id is also used in native protocol. (Think this was discussed elsewhere) Bikeshed: Is the term "sstable" still correct? I didn't catch the point why only maps and sets don't naturally fit into columnar format but lists, strings and blobs do. Or is it just because of their mean serialized size? If I understood the columnar concept correctly, it works best, if a whole partition fits in one (64k) block. The more additional (heap-)blocks one partition needs, the more efficient is row-oriented format. Correct?
          Hide
          benedict Benedict added a comment -

          CASSANDRA-7443 is good to clean up the code base and should be done first

          This, and further related patches, are a necessary prerequisite to this ticket, yes.

          Using 32bit instead of 64bit pointers could also save some space.

          I would prefer not to go down this route just yet, as it is error prone to be optimising this in the first version. Any optimisations that can be made universally (i.e. guaranteed to be safe for all file sizes) I'm onboard with, but obfuscating code dependent on file size I'm not. Especially as this introduces an extra condition to execute on every single field access, potentially stalling the processor pipeline more readily.

          Trie + byte ordered types: would this mean to do some "special serialization" e.g. for timeuuid to make them "binary comparable"?

          Yes

          If one partition only contains one row, "plain" row-oriented storage seems to be more efficient. Is this what "small partition layout" is meant for?

          No, it is because it requires fewer disk accesses to have it all packed into the same block (or we can have smaller blocks, increasing IOPS esp. on SSDs). In fact it is quite reasonable to assume that even with single row partitions the column oriented storage will be more efficient, as the columns do not care about partitions; they extend across all partitions, and so the serialization costs are reduced even if there are no clustering columns.

          I should note that the presentation at ngcc is only for historical reference and to get familiar with the general discussion. As mentioned in the description of this ticket, I now favour a row-oriented approach backed by the new index structures for many of the non-optimal column-oriented use cases, which may reduce the necessity of a compact column-oriented form, although it would still be useful as just described.

          Column names (CQL): I'd prefer to extend the table definition schema with an integer column id and use that. Could save lots of String.hashCode/equals() calls - even if the column-id is also used in native protocol. (Think this was discussed elsewhere)

          There is a separate ticket for this, and I consider it to be an orthogonal effort. We can more easily deliver it here than we can cross-cluster (personally I favour cross-cluster names to be supported by a general enum type (CASSANDRA-6917))

          Bikeshed: Is the term "sstable" still correct?

          The original sstable was only imposing a sort-order on the partition keys. This will still be imposed, so yes, but I don't have any strong attachment to it.

          I didn't catch the point why only maps and sets don't naturally fit into columnar format but lists, strings and blobs do. Or is it just because of their mean serialized size?

          They don't logically fit because they are an extra dimension, much as static columns are one fewer dimension. Columnar layouts really need fixed dimensionality. You can flatten maps, sets and lists (my list was not exhaustive), but this incurs significant cost and complexity on reading these across multiple sstables, as opposed to relying on the standard machinery. Strings and blobs can more trivially be split out into an extra file if they are too large (for simplicity of first delivery we can just append all values larger than some limit to a file, and replace them with their location in the file), but storing large strings in a columnar layout is generally not sensible/beneficial anyway.

          In all likelihood I think the best approach may be to permit collections and statics on column oriented tables by splitting them into a separate row-oriented sstable, at least in the near-term. The heap-blocks outlined in the ngcc talk could be delivered later, although I might be inclined to tell users that column oriented storage is not for them if they want to store these things in the table.

          Show
          benedict Benedict added a comment - CASSANDRA-7443 is good to clean up the code base and should be done first This, and further related patches, are a necessary prerequisite to this ticket, yes. Using 32bit instead of 64bit pointers could also save some space. I would prefer not to go down this route just yet, as it is error prone to be optimising this in the first version. Any optimisations that can be made universally (i.e. guaranteed to be safe for all file sizes) I'm onboard with, but obfuscating code dependent on file size I'm not. Especially as this introduces an extra condition to execute on every single field access, potentially stalling the processor pipeline more readily. Trie + byte ordered types: would this mean to do some "special serialization" e.g. for timeuuid to make them "binary comparable"? Yes If one partition only contains one row, "plain" row-oriented storage seems to be more efficient. Is this what "small partition layout" is meant for? No, it is because it requires fewer disk accesses to have it all packed into the same block (or we can have smaller blocks, increasing IOPS esp. on SSDs). In fact it is quite reasonable to assume that even with single row partitions the column oriented storage will be more efficient, as the columns do not care about partitions; they extend across all partitions, and so the serialization costs are reduced even if there are no clustering columns. I should note that the presentation at ngcc is only for historical reference and to get familiar with the general discussion. As mentioned in the description of this ticket, I now favour a row-oriented approach backed by the new index structures for many of the non-optimal column-oriented use cases, which may reduce the necessity of a compact column-oriented form, although it would still be useful as just described. Column names (CQL): I'd prefer to extend the table definition schema with an integer column id and use that. Could save lots of String.hashCode/equals() calls - even if the column-id is also used in native protocol. (Think this was discussed elsewhere) There is a separate ticket for this, and I consider it to be an orthogonal effort. We can more easily deliver it here than we can cross-cluster (personally I favour cross-cluster names to be supported by a general enum type ( CASSANDRA-6917 )) Bikeshed: Is the term "sstable" still correct? The original sstable was only imposing a sort-order on the partition keys. This will still be imposed, so yes, but I don't have any strong attachment to it. I didn't catch the point why only maps and sets don't naturally fit into columnar format but lists, strings and blobs do. Or is it just because of their mean serialized size? They don't logically fit because they are an extra dimension, much as static columns are one fewer dimension. Columnar layouts really need fixed dimensionality. You can flatten maps, sets and lists (my list was not exhaustive), but this incurs significant cost and complexity on reading these across multiple sstables, as opposed to relying on the standard machinery. Strings and blobs can more trivially be split out into an extra file if they are too large (for simplicity of first delivery we can just append all values larger than some limit to a file, and replace them with their location in the file), but storing large strings in a columnar layout is generally not sensible/beneficial anyway. In all likelihood I think the best approach may be to permit collections and statics on column oriented tables by splitting them into a separate row-oriented sstable, at least in the near-term. The heap-blocks outlined in the ngcc talk could be delivered later, although I might be inclined to tell users that column oriented storage is not for them if they want to store these things in the table.
          Hide
          tjake T Jake Luciani added a comment -

          Is there any reason why you want to put the row index block next to the data? This actually makes it tricky to make sstables pluggable since right now we would put this index in the index.db file. It could be in both places I suppose since it would help with recovery to have multiple copies.

          Also if you plan of putting the index at the front of the row you would need to do some kind of two pass to write the partition. It would be better to place it at the end since for huge partitions it would cause issues (we don't do two pass comp actions anymore)

          Show
          tjake T Jake Luciani added a comment - Is there any reason why you want to put the row index block next to the data? This actually makes it tricky to make sstables pluggable since right now we would put this index in the index.db file. It could be in both places I suppose since it would help with recovery to have multiple copies. Also if you plan of putting the index at the front of the row you would need to do some kind of two pass to write the partition. It would be better to place it at the end since for huge partitions it would cause issues (we don't do two pass comp actions anymore)
          Hide
          benedict Benedict added a comment -

          Is there any reason why you want to put the row index block next to the data?

          If we are going out of cache, we may as well read the index + data, rather than then index then data. With HDDs this should avoid any penalty. Bear in mind the index also is the data in this brave new world. My goal with the new format is to, as far as possible, guarantee as many or fewer seeks to the old format (even if SSDs are becoming more prevalent), whilst reducing the total amount of space necessary (so reduce requisite disk bandwidth and improve cache occupancy).

          Is there any reason why you want to put the row index block next to the data? This actually makes it tricky to make sstables pluggable since right now we would put this index in the index.db file. It could be in both places I suppose since it would help with recovery to have multiple copies.

          Why does this make pluggability hard? The index is an artefact of the sstable type (or it should be, before we roll this out), so it shouldn't matter?

          Also if you plan of putting the index at the front of the row you would need to do some kind of two pass to write the partition.

          Maybe. I'd prefer not to get down to this level of specifics just yet, I'm pretty sure it's solvable either way. It would be preferable to focus mostly on the overall design, featureset, etc. for the moment. The format is likely to be agnostic to where the two records live with respect to each other, but there are some optimisations possible on read if they're adjacent, assuming the records are all smaller than a page. If they are much larger than that, no optimisation is likely to help so it doesn't matter too much, and if they are smaller we only have to buffer two pages.

          Show
          benedict Benedict added a comment - Is there any reason why you want to put the row index block next to the data? If we are going out of cache, we may as well read the index + data, rather than then index then data. With HDDs this should avoid any penalty. Bear in mind the index also is the data in this brave new world. My goal with the new format is to, as far as possible, guarantee as many or fewer seeks to the old format (even if SSDs are becoming more prevalent), whilst reducing the total amount of space necessary (so reduce requisite disk bandwidth and improve cache occupancy). Is there any reason why you want to put the row index block next to the data? This actually makes it tricky to make sstables pluggable since right now we would put this index in the index.db file. It could be in both places I suppose since it would help with recovery to have multiple copies. Why does this make pluggability hard? The index is an artefact of the sstable type (or it should be, before we roll this out), so it shouldn't matter? Also if you plan of putting the index at the front of the row you would need to do some kind of two pass to write the partition. Maybe. I'd prefer not to get down to this level of specifics just yet, I'm pretty sure it's solvable either way. It would be preferable to focus mostly on the overall design, featureset, etc. for the moment. The format is likely to be agnostic to where the two records live with respect to each other, but there are some optimisations possible on read if they're adjacent, assuming the records are all smaller than a page. If they are much larger than that, no optimisation is likely to help so it doesn't matter too much, and if they are smaller we only have to buffer two pages.
          Hide
          tjake T Jake Luciani added a comment - - edited

          Why does this make pluggability hard? The index is an artifact of the sstable type (or it should be, before we roll this out), so it shouldn't matter?

          You are right but it makes things simpler since you already have all the code in place to put the index next to the partition location, In the end the only thing required will be the the file offset to the start of the partitions. data (which can be your index)

          Show
          tjake T Jake Luciani added a comment - - edited Why does this make pluggability hard? The index is an artifact of the sstable type (or it should be, before we roll this out), so it shouldn't matter? You are right but it makes things simpler since you already have all the code in place to put the index next to the partition location, In the end the only thing required will be the the file offset to the start of the partitions. data (which can be your index)
          Hide
          benedict Benedict added a comment -

          you already have all the code in place to put the index next to the partition location

          I'm not sure I follow, but one of the goals is to permit faster in memory (cached) performance, which means being more targeted with where we hit data inside our pages so that we can cache with finer granularity (and so have a higher cache hit rate), so we don't want to scan entire pages if we can avoid it.

          We have one index right now, and one data file, within both of which we persist clustering keys. This new scheme has one partition index, one data file, and one hybrid dataset, which can live by itself or in the datafile, but behaves as both a clustering index and the data itself. So when we're talking about an index things can get confusing. However we want to be able to support (and improve upon) the current ability to seek directly within partitions, and we want to be able to do so efficiently, without extra disk seeks, so we ideally want these clustering key records to be cached independently of the rest of the data since they will/may be referenced more frequently.

          Show
          benedict Benedict added a comment - you already have all the code in place to put the index next to the partition location I'm not sure I follow, but one of the goals is to permit faster in memory (cached) performance, which means being more targeted with where we hit data inside our pages so that we can cache with finer granularity (and so have a higher cache hit rate), so we don't want to scan entire pages if we can avoid it. We have one index right now, and one data file, within both of which we persist clustering keys. This new scheme has one partition index, one data file, and one hybrid dataset, which can live by itself or in the datafile, but behaves as both a clustering index and the data itself. So when we're talking about an index things can get confusing. However we want to be able to support (and improve upon) the current ability to seek directly within partitions, and we want to be able to do so efficiently, without extra disk seeks, so we ideally want these clustering key records to be cached independently of the rest of the data since they will/may be referenced more frequently.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Any features not supported will require sticking with the old format until we extend support to all C* functionality.

          I'm not extremely fan of that part tbh. I would have a strong preference for introducing a file format that is better that our current one (not very hard) but is generic enough to cover everything, even if this means that it's not as optimized for some workloads that it can possibly be. Which probably means a row oriented approach first. We can then optimize some special cases later.

          One reason is that I'd prefer not having to keep/maintain the existing file format for multiple versions, but generally, it feels more sane to me to get one format reasonably performant first, and then add special case/variants. This would also delay the debate on how much we're willing to special the sstable format for different workload, debate on which I'm not sure we all have the same opinion, and debate that is currently somewhat clouded by the fact that the existing format is really rather inefficient.

          Show
          slebresne Sylvain Lebresne added a comment - Any features not supported will require sticking with the old format until we extend support to all C* functionality. I'm not extremely fan of that part tbh. I would have a strong preference for introducing a file format that is better that our current one (not very hard) but is generic enough to cover everything, even if this means that it's not as optimized for some workloads that it can possibly be. Which probably means a row oriented approach first. We can then optimize some special cases later. One reason is that I'd prefer not having to keep/maintain the existing file format for multiple versions, but generally, it feels more sane to me to get one format reasonably performant first, and then add special case/variants. This would also delay the debate on how much we're willing to special the sstable format for different workload, debate on which I'm not sure we all have the same opinion, and debate that is currently somewhat clouded by the fact that the existing format is really rather inefficient.
          Hide
          benedict Benedict added a comment -

          The goal of the new file design is that it is sufficiently modular that modifying it to support a row-oriented approach is not very hard. However the magnitude of performance differential will not be as dramatic, so I prefer to deliver the dramatic benefits first.

          Supporting ALL current users is going to be a bit more involved, as I cut from the scope of the initial delivery support for non-byte-ordered types (excluding those we can transform to byte-ordered, which includes all CQL types I'm reasonably confident). This is because supporting non-byte-ordered types not only severely complicates the delivery of the intra-partition indexing, it also makes it suboptimal once delivered. My preference would be to certainly not deliver support for this immediately, but possibly not to deliver at all until we receive a lot of evidence it's necessary from users asking for it when looking to upgrade, since encouraging its obsolescence is a positive step for simplifying and maintaining future performance characteristics. Either way, it should not be the default due to its negative implications.

          In the next week or so I should be posting the outline of the next-gen container type, which will loosely be comprised of the three components outlined above, however defined in such a way that any of the components can be switched to yield different characteristics or support different workloads.

          In summary, we disagree on multiple levels, however delivering a columnar vs row is not the most important. Delivering a row-oriented solution will use largely the same components as a columnar approach and, assuming there is time before release, it's quite achievable to deliver both simultaneously. Supporting all current use cases in the new file format is a great deal more involved, however. There is certainly a great deal of benefit, either way, to fleshing out more of the details of the new-and-improved approaches, since the subtleties will not be as clear to us in advance as they are when thinking about/discussing the existing state of affairs. So simply rewriting the existing state of affairs seems to me to run the slight risk of yielding a piece of work that is not trivially adapted to support the more optimal use cases.

          Show
          benedict Benedict added a comment - The goal of the new file design is that it is sufficiently modular that modifying it to support a row-oriented approach is not very hard. However the magnitude of performance differential will not be as dramatic, so I prefer to deliver the dramatic benefits first. Supporting ALL current users is going to be a bit more involved, as I cut from the scope of the initial delivery support for non-byte-ordered types (excluding those we can transform to byte-ordered, which includes all CQL types I'm reasonably confident). This is because supporting non-byte-ordered types not only severely complicates the delivery of the intra-partition indexing, it also makes it suboptimal once delivered. My preference would be to certainly not deliver support for this immediately, but possibly not to deliver at all until we receive a lot of evidence it's necessary from users asking for it when looking to upgrade, since encouraging its obsolescence is a positive step for simplifying and maintaining future performance characteristics. Either way, it should not be the default due to its negative implications. In the next week or so I should be posting the outline of the next-gen container type, which will loosely be comprised of the three components outlined above, however defined in such a way that any of the components can be switched to yield different characteristics or support different workloads. In summary, we disagree on multiple levels, however delivering a columnar vs row is not the most important. Delivering a row-oriented solution will use largely the same components as a columnar approach and, assuming there is time before release, it's quite achievable to deliver both simultaneously. Supporting all current use cases in the new file format is a great deal more involved, however. There is certainly a great deal of benefit, either way, to fleshing out more of the details of the new-and-improved approaches, since the subtleties will not be as clear to us in advance as they are when thinking about/discussing the existing state of affairs. So simply rewriting the existing state of affairs seems to me to run the slight risk of yielding a piece of work that is not trivially adapted to support the more optimal use cases.
          Hide
          slebresne Sylvain Lebresne added a comment -

          So we disagree. So I'll try to detail a bit more clearly what I disagree with and explain how I would rather proceed. Others may then chime in with their opinions (and if it turns out my objection are my own only, I'll happily yield to the majority).

          You propose as ultimate goal to have not just one "static" file-format, but rather a somewhat modular file-format with specialized variants to handle some type of workload/tables. I have no problem with that ultimate goal (provided things are well enough though of that it's not a completely unmaintainable mess for no big benefits of course, but I suppose we agree on that).

          Where I disagree however is how we proceed to get there and where we start. My understanding is that you're suggesting a first version that is pretty specialized:

          • it would only work for table whose comparator is byte-ordered. Today, it's text/ascii, blob and inet and that's it. There's a plan (CASSANDRA-6936) for transforming values on the fly for the other types to make them byte-ordered, but it's still as 2nd step, and one that is not all that clear for all types.
          • it would exclude "static columns, sets and maps". I suspect this would include nested collections and UDT "done right" which we plan to introduce in 3.0 too.
            Now, I understand that you're not saying that the file-format would never lift those limitations, but they would still initial limitation of the first iteration. And there is no clear plan on when the rest will be delivered. In particular, it means we'll have to maintain the old file format for an indeterminate amount of time (Some of you comments even suggest that supporting all tables will not be a very high priority at all ("not deliver support for this immediately, but possibly not to deliver at all until we receive a lot of evidence it's necessary" or "3.0 storage engine won't be universal for a while (maybe never for thrift)" in CASSANDRA-6276)).

          What I would suggest instead would be this:

          1. Deliver that modular file-format but with for each component something that can handle everything we have. I claim that it's not too hard to do that and still have this first version being a net win on the current format (because that's a low bar).
          2. Optimize specific components for specific table structure/workload.

          Doing it in this order has the following advantages imo:

          • we stop relying on the old format right away. We don't have to maintain legacy code for an indeterminate time.
          • we can discuss the details of each specific optimization more in isolation.
          • we can evaluate each optimization from a much saner base.

          This is because supporting non-byte-ordered types not only severely complicates the delivery of the intra-partition indexing

          I'm not sure why that is. We have an existing intra-partition indexing that perfectly support non-byte-ordered types and it's pretty simple (better, we even have the code already). Can we do better, certainly, but that doesn't mean that we have to change everything right away. Making the file format more modular and making it more compact (by at least (but not limited to) avoiding the repetition of the clustering columns for a given CQL row, using a compact encoding of CQL column names and using offset-encoding for timestamps) would certainly be a good start in my book, one that would be not too big (a feature) and one on which we can build on.

          Show
          slebresne Sylvain Lebresne added a comment - So we disagree. So I'll try to detail a bit more clearly what I disagree with and explain how I would rather proceed. Others may then chime in with their opinions (and if it turns out my objection are my own only, I'll happily yield to the majority). You propose as ultimate goal to have not just one "static" file-format, but rather a somewhat modular file-format with specialized variants to handle some type of workload/tables. I have no problem with that ultimate goal (provided things are well enough though of that it's not a completely unmaintainable mess for no big benefits of course, but I suppose we agree on that). Where I disagree however is how we proceed to get there and where we start. My understanding is that you're suggesting a first version that is pretty specialized: it would only work for table whose comparator is byte-ordered. Today, it's text/ascii, blob and inet and that's it. There's a plan ( CASSANDRA-6936 ) for transforming values on the fly for the other types to make them byte-ordered, but it's still as 2nd step, and one that is not all that clear for all types. it would exclude "static columns, sets and maps". I suspect this would include nested collections and UDT "done right" which we plan to introduce in 3.0 too. Now, I understand that you're not saying that the file-format would never lift those limitations, but they would still initial limitation of the first iteration. And there is no clear plan on when the rest will be delivered. In particular, it means we'll have to maintain the old file format for an indeterminate amount of time (Some of you comments even suggest that supporting all tables will not be a very high priority at all ("not deliver support for this immediately, but possibly not to deliver at all until we receive a lot of evidence it's necessary" or "3.0 storage engine won't be universal for a while (maybe never for thrift)" in CASSANDRA-6276 )). What I would suggest instead would be this: Deliver that modular file-format but with for each component something that can handle everything we have. I claim that it's not too hard to do that and still have this first version being a net win on the current format (because that's a low bar). Optimize specific components for specific table structure/workload. Doing it in this order has the following advantages imo: we stop relying on the old format right away. We don't have to maintain legacy code for an indeterminate time. we can discuss the details of each specific optimization more in isolation. we can evaluate each optimization from a much saner base. This is because supporting non-byte-ordered types not only severely complicates the delivery of the intra-partition indexing I'm not sure why that is. We have an existing intra-partition indexing that perfectly support non-byte-ordered types and it's pretty simple (better, we even have the code already). Can we do better, certainly, but that doesn't mean that we have to change everything right away. Making the file format more modular and making it more compact (by at least (but not limited to) avoiding the repetition of the clustering columns for a given CQL row, using a compact encoding of CQL column names and using offset-encoding for timestamps) would certainly be a good start in my book, one that would be not too big (a feature) and one on which we can build on.
          Hide
          benedict Benedict added a comment -

          I think we're conflating here what will make it into GA, and what will be delvered first. I want to deliver not a specialised format, but a lightly modular format from which comparatively small pieces of work allow us to expand its power and applicabilitiy, and I want to deliver the most common and interesting (and most demonstrably superior) uses first. I'm happy to debate what we aim to release simultaneously with a GA release. My goal is to deliver, in order (and hopefully all before GA):

          byte-ordered composite column indexing -> row/columnar hybrid (i.e. a sub-square of a given grid in any page on disk), pages packed row-orientedly
          byte-ordered composite column indexing -> columnar (i.e. configurable one or more columns per page), pages packed column-oriented
          byte-ordered composite column indexing -> row oriented, nesting of statics, collections, UDTs etc

          Note that supporting statics, collections, UDTs etc with columnar is as simple as splitting it out into a row-oriented (old or new) format, so it does not inherently limit the scope of their applicability

          It's quite possible we can deliver all of this for GA of 3.0, however I want to focus on the biggest wins first. Now, whilst one of the very main driving forces for the new format is how inherently inefficient the current indexing is, that you mention we have already and could use, we could relatively easily migrate it to the new format if we wanted, or at least a version with some of the new algorithmic benefits but similar overall algorithmic and space complexity. However I don't see a tremendous benefit to this beyond leaving users with the old format until we can deliver the maximal algorithmic and space complexity benefits of an index that supports both non-byte-ordered and byte-ordered data simultaneously and optimally.

          More importantly, as I previously stated and you do not appear to have responded to directly, there are major benefits to eliminating non-byte-ordered comparators from the codebase entirely, and there is limited evidence that they are widespread outside of the C* types we support that do not currently have a byte-ordered representation. The work to translate these is comparatively straight forward, although not trivial, and yields its own benefit of relying on more efficient data structures and code paths to improve efficiency for anyone using these types regardless of our support non-byte-ordered types. However, once we've done this, given the limited evidence of their wide use outside of these types I have a very strong preference for creating some pressure to eliminate them entirely from the ecosystem, only letting up if the pushback is sufficient.

          So for this reason alone I prefer to not deliver indexing supporting this.

          However if that is not generally supported as a position, we could work in tandem with other work to produce a simpler indexing component that has no better algorithmic or space complexity to the current indexing scheme and support it. But I would at the very least want to see it turned off by default, and would only rather allocate resources to it once we know we're ahead of the game for the big release points for 3.0

          If, however, we wanted to optimally and generally, support non-byte-ordered comparators (which I can't quite make out what your position is on), we need to focus on a hybrid data structure which appears at the moment to be of significantly increased complexity to the considerably more optimal byte-ordered data structure, or a regular non-byte-ordered (only) structure. I definitely want to leave work on something like this until after we've delivered our big wins, and preferably have strong evidence it will be widely useful.

          Show
          benedict Benedict added a comment - I think we're conflating here what will make it into GA, and what will be delvered first. I want to deliver not a specialised format, but a lightly modular format from which comparatively small pieces of work allow us to expand its power and applicabilitiy, and I want to deliver the most common and interesting (and most demonstrably superior) uses first. I'm happy to debate what we aim to release simultaneously with a GA release. My goal is to deliver, in order (and hopefully all before GA): byte-ordered composite column indexing -> row/columnar hybrid (i.e. a sub-square of a given grid in any page on disk), pages packed row-orientedly byte-ordered composite column indexing -> columnar (i.e. configurable one or more columns per page), pages packed column-oriented byte-ordered composite column indexing -> row oriented, nesting of statics, collections, UDTs etc Note that supporting statics, collections, UDTs etc with columnar is as simple as splitting it out into a row-oriented (old or new) format, so it does not inherently limit the scope of their applicability It's quite possible we can deliver all of this for GA of 3.0, however I want to focus on the biggest wins first. Now, whilst one of the very main driving forces for the new format is how inherently inefficient the current indexing is, that you mention we have already and could use, we could relatively easily migrate it to the new format if we wanted, or at least a version with some of the new algorithmic benefits but similar overall algorithmic and space complexity. However I don't see a tremendous benefit to this beyond leaving users with the old format until we can deliver the maximal algorithmic and space complexity benefits of an index that supports both non-byte-ordered and byte-ordered data simultaneously and optimally. More importantly, as I previously stated and you do not appear to have responded to directly, there are major benefits to eliminating non-byte-ordered comparators from the codebase entirely, and there is limited evidence that they are widespread outside of the C* types we support that do not currently have a byte-ordered representation. The work to translate these is comparatively straight forward, although not trivial, and yields its own benefit of relying on more efficient data structures and code paths to improve efficiency for anyone using these types regardless of our support non-byte-ordered types. However, once we've done this, given the limited evidence of their wide use outside of these types I have a very strong preference for creating some pressure to eliminate them entirely from the ecosystem, only letting up if the pushback is sufficient. So for this reason alone I prefer to not deliver indexing supporting this. However if that is not generally supported as a position, we could work in tandem with other work to produce a simpler indexing component that has no better algorithmic or space complexity to the current indexing scheme and support it. But I would at the very least want to see it turned off by default, and would only rather allocate resources to it once we know we're ahead of the game for the big release points for 3.0 If, however, we wanted to optimally and generally, support non-byte-ordered comparators (which I can't quite make out what your position is on), we need to focus on a hybrid data structure which appears at the moment to be of significantly increased complexity to the considerably more optimal byte-ordered data structure, or a regular non-byte-ordered (only) structure. I definitely want to leave work on something like this until after we've delivered our big wins, and preferably have strong evidence it will be widely useful.
          Hide
          benedict Benedict added a comment -

          Since the prior comment was very verbose, I want to clarify my goals and approach a little independently.

          There are a number of major puzzle pieces the end-state of this general push entails, and we want to deliver it incrementally. As such my main desire with a plan of attack is to deliver the minimal components necessary to deliver some of the true final state functionality. i.e. components that are themselves complete. But also deliver components that are themselves well placed to be built out from (or adjacent to) to fill in the remaining end-state goals. My desire is to plan the puzzle pieces so that delivering a majority of functionality before GA is possible (but probably not guaranteed), but that there are many fully functional pre-end-states enroute. This is optimally done without producing redundant work.

          I also want to slightly sharpen and narrow the focus of the storage engine so that it can not only be more efficient, but can also be maintained and expanded more efficiently. These two goals are strongly aided by dropping support for non-byte-ordered clustering types. A last minute pressure-induced suboptimal solution for these usecases would NOT be a tremendously complex task, however, so there doesn't seem much lost in delivering this later.

          I've been considering various modifications to the approach outlined so far, and there are some fairly significant (though not earth shifting) changes I will outline in a follow up comment, when I'm in a better position to do so (hopefully very soon). It's quite possible, given some thought, that given the changes I intend to outline I would be quite comfortable delivering a row-oriented approach first. However even this would be delivered incrementally, and would most likely not support collections and some other complex features we're introducing with UDTs in 3.0 in the first blush. This would introduce approximately zero extra work to deliver it in stages, but permits much greater QA work and easier review, integration and development.

          Show
          benedict Benedict added a comment - Since the prior comment was very verbose, I want to clarify my goals and approach a little independently. There are a number of major puzzle pieces the end-state of this general push entails, and we want to deliver it incrementally . As such my main desire with a plan of attack is to deliver the minimal components necessary to deliver some of the true final state functionality. i.e. components that are themselves complete. But also deliver components that are themselves well placed to be built out from (or adjacent to) to fill in the remaining end-state goals. My desire is to plan the puzzle pieces so that delivering a majority of functionality before GA is possible (but probably not guaranteed), but that there are many fully functional pre-end-states enroute. This is optimally done without producing redundant work. I also want to slightly sharpen and narrow the focus of the storage engine so that it can not only be more efficient, but can also be maintained and expanded more efficiently. These two goals are strongly aided by dropping support for non-byte-ordered clustering types. A last minute pressure-induced suboptimal solution for these usecases would NOT be a tremendously complex task, however, so there doesn't seem much lost in delivering this later. I've been considering various modifications to the approach outlined so far, and there are some fairly significant (though not earth shifting) changes I will outline in a follow up comment, when I'm in a better position to do so (hopefully very soon). It's quite possible, given some thought, that given the changes I intend to outline I would be quite comfortable delivering a row-oriented approach first. However even this would be delivered incrementally , and would most likely not support collections and some other complex features we're introducing with UDTs in 3.0 in the first blush. This would introduce approximately zero extra work to deliver it in stages, but permits much greater QA work and easier review, integration and development.
          Hide
          slebresne Sylvain Lebresne added a comment -

          I do not want to maintain the old file format and its related code for an indeterminate amount of time, and in fact, I don't see reason to risk not having the new format obsolete the old one right away. And there is enough things to improve with the current file format that I'm absolutely convinced we can start with a version that does support everything and yet improve on the existing, while still not limiting further, more specific, optimizations. Given that it's the case, it's seems obviously the right place to start to me and so, unless the majority disagrees with me, I'm oppose to doing it any other way. If it's not the more dramatic changes first, that's fine by me, I do not think that prioritizing specific workload first is a good idea and "because Benedict wants to be sure to have it's dramatic improvement in 3.0" is not a very strong argument in my book.

          On the issue of byte-order comparisons, I'm not against having optimized components for it, but I'm a very strong -1 on not supporting them at all. They may not be widespread but they are used and have their use. And the fact that we have this kind of debate when discussing the very first iteration of a new file format is what make me convinced that we're doing it wrong.

          I've understood what you prefer, but I disagree that this is the proper order for doing it and more precisely I strongly think that we should start with a version that supports everything. I'm completely on board on the issue of having a more modular file format and I'm good in general with the optimization proposed here, as long as they are done as optimization in a 2 step (which would allow to have a more focused debate on the details of those more specialized optimizations in particular).

          At this point, I'm interested in getting other opinions.

          Show
          slebresne Sylvain Lebresne added a comment - I do not want to maintain the old file format and its related code for an indeterminate amount of time, and in fact, I don't see reason to risk not having the new format obsolete the old one right away. And there is enough things to improve with the current file format that I'm absolutely convinced we can start with a version that does support everything and yet improve on the existing, while still not limiting further, more specific, optimizations. Given that it's the case, it's seems obviously the right place to start to me and so, unless the majority disagrees with me, I'm oppose to doing it any other way. If it's not the more dramatic changes first, that's fine by me, I do not think that prioritizing specific workload first is a good idea and "because Benedict wants to be sure to have it's dramatic improvement in 3.0" is not a very strong argument in my book. On the issue of byte-order comparisons, I'm not against having optimized components for it, but I'm a very strong -1 on not supporting them at all. They may not be widespread but they are used and have their use. And the fact that we have this kind of debate when discussing the very first iteration of a new file format is what make me convinced that we're doing it wrong. I've understood what you prefer, but I disagree that this is the proper order for doing it and more precisely I strongly think that we should start with a version that supports everything. I'm completely on board on the issue of having a more modular file format and I'm good in general with the optimization proposed here, as long as they are done as optimization in a 2 step (which would allow to have a more focused debate on the details of those more specialized optimizations in particular). At this point, I'm interested in getting other opinions.
          Hide
          benedict Benedict added a comment -

          At this point, I'm interested in getting other opinions.

          Agreed. Although you seem to have brushed over my point about delivering this project incrementally. Picking easier subsections of functionality that are on an approximately linear line to delivery of all functionality makes the work more digestible, testable and debatable. It also happens to allow us to guarantee delivery of high quality functionality in time for 3.0, even if it's possibly only a subset (though a good chance we can hit all of it), instead of delivering fairly average functionality that covers all users.

          As I also tried to make clear, if your goal is to eliminate the old format entirely and this is widely supported then once we have delivered the first tranche of components we can relatively trivially deliver a suboptimal addendum to support all of the old features (with only moderate benefit over the legacy format). However even if the goal of eliminating the legacy format is decided upon, it seems best to leave this until much nearer GA as if we can deliver the full featureset before GA this is simply wasted effort.

          Show
          benedict Benedict added a comment - At this point, I'm interested in getting other opinions. Agreed. Although you seem to have brushed over my point about delivering this project incrementally. Picking easier subsections of functionality that are on an approximately linear line to delivery of all functionality makes the work more digestible, testable and debatable. It also happens to allow us to guarantee delivery of high quality functionality in time for 3.0, even if it's possibly only a subset (though a good chance we can hit all of it), instead of delivering fairly average functionality that covers all users. As I also tried to make clear, if your goal is to eliminate the old format entirely and this is widely supported then once we have delivered the first tranche of components we can relatively trivially deliver a suboptimal addendum to support all of the old features (with only moderate benefit over the legacy format). However even if the goal of eliminating the legacy format is decided upon, it seems best to leave this until much nearer GA as if we can deliver the full featureset before GA this is simply wasted effort.
          Hide
          slebresne Sylvain Lebresne added a comment -

          you seem to have brushed over my point about delivering this project incrementally

          I'm completely on board about doing it incrementally. I'm just not sure what is intrinsically less incremental in what I'm suggesting. We both agree that we want a modular file-format, we just disagree on what components we should start with. I'm saying we should start with generic components with no more ambition than to be no less efficient than what we have currently (but let's be clear that it's not very hard to improve on at least: avoiding repeating clustering columns, using a compact encoding for CQL column names, and use offset-encoding for timestamps, all of which are part of what you are suggesting for the first iteration anyway). You want to start with specialized components that are somewhat optimal for some workloads. I claim that there is a lot more to debate in what you're suggesting and I certainly refute the claim that what I'm suggesting is less digestible and testable (I would argue to the contrary in fact).

          It also happens to allow us to guarantee delivery of high quality functionality in time for 3.0, even if it's possibly only a subset (though a good chance we can hit all of it), instead of delivering fairly average functionality that covers all users

          I do prefer delivering moderate improvements for all cases first and risking that the bigger but more restricted improvement slip by, because this also mean reducing the risk of having to maintain the old format longer (and as said above, I do claim it's more digestible/incremental in practice). In fact, your comments so far have strongly suggested that having the file format cover everything was not even your second step.

          However even if the goal of eliminating the legacy format is decided upon, it seems best to leave this until much nearer GA as if we can deliver the full featureset before GA this is simply wasted effort.

          And to be clear, I respectfully disagree (that it's best, but I also don't see how it's wasted effort). I ask that we collectively decide both on 1) whether or not eliminating the old format is a goal (I'm very very strongly opposed to this not being a goal at all personally) and 2) whether or not we should start by covering everything first and optimize later for some cases versus starting by covering specific subsets first and add the more general part second.

          Show
          slebresne Sylvain Lebresne added a comment - you seem to have brushed over my point about delivering this project incrementally I'm completely on board about doing it incrementally. I'm just not sure what is intrinsically less incremental in what I'm suggesting. We both agree that we want a modular file-format, we just disagree on what components we should start with. I'm saying we should start with generic components with no more ambition than to be no less efficient than what we have currently (but let's be clear that it's not very hard to improve on at least: avoiding repeating clustering columns, using a compact encoding for CQL column names, and use offset-encoding for timestamps, all of which are part of what you are suggesting for the first iteration anyway). You want to start with specialized components that are somewhat optimal for some workloads. I claim that there is a lot more to debate in what you're suggesting and I certainly refute the claim that what I'm suggesting is less digestible and testable (I would argue to the contrary in fact). It also happens to allow us to guarantee delivery of high quality functionality in time for 3.0, even if it's possibly only a subset (though a good chance we can hit all of it), instead of delivering fairly average functionality that covers all users I do prefer delivering moderate improvements for all cases first and risking that the bigger but more restricted improvement slip by, because this also mean reducing the risk of having to maintain the old format longer (and as said above, I do claim it's more digestible/incremental in practice). In fact, your comments so far have strongly suggested that having the file format cover everything was not even your second step. However even if the goal of eliminating the legacy format is decided upon, it seems best to leave this until much nearer GA as if we can deliver the full featureset before GA this is simply wasted effort. And to be clear, I respectfully disagree (that it's best, but I also don't see how it's wasted effort). I ask that we collectively decide both on 1) whether or not eliminating the old format is a goal (I'm very very strongly opposed to this not being a goal at all personally) and 2) whether or not we should start by covering everything first and optimize later for some cases versus starting by covering specific subsets first and add the more general part second.
          Hide
          benedict Benedict added a comment -

          It sounds like we're approaching agreement on our terms of disagreement, which is great

          I'm just not sure what is intrinsically less incremental in what I'm suggesting.

          The relatively simple (but still time consuming) solution you seem to be advocating sounds like one we would obsolete entirely once the full featureset I want to deliver is completed, since it is less efficient. It's quite possible it wouldn't even translate in the way you imagine, since the indexing is very different in the new scheme and so the current approach would not be possible (we need to implement some kind of tree structure for search in clustering columns, at the very least, since we will not have any of the current KeyCache machiner, and limiting ourselves to byte-ordered types permits us to not only use but assume especially efficient tries that are both more efficient on disk but also algorithmically more efficient for merging on read). But, either way, this obsolescence seems to largely get in the way of the real end-state we're aiming for, by introducing unnecessary work. I want incremental and non-superfluous deliverables en route.

          As I stated above I might be willing to aim for a row-oriented approach first (although incrementally delivered, so likely not supporting everything in one step, simply because this is easier and more digestable) which, once fully delivered, would support everything - except this implementation I want to aim for would be for byte-ordered clustering columns only. If instead we want to support non-byte-ordered, we either have to do more work to deliver an optimal result (i.e. a complex hybrid btree/trie-variant), which is distinctly less incremental, or deliver a suboptimal result that only exists temporarily until optimised, so is wasted effort. Either seem to run the risk of a reduction in final featureset when done earlier in the process, simply by consuming precious time, so I prefer to implement these features much later, if at all (like I said, I strongly favour dropping support altogether), since at last minute we can simply deliver an average/poor implementation if necessary. Especially given we know of no use cases outside of CQL types, to my knowledge, and we'll be eliminating those from the equation.

          In summary, we need to agree on your points (1) and (2) with my new comments, but also need (1a) and (2a):

          (1a): Do we want to try and eliminate this early in the process, or should we aim our end state for eliminating it and only fill in our suboptimal solution we'd deliver earlier if we are running out of time?
          (2a): Does 'everything' include non-byte-ordered types, or if we can try to EOL them. Bearing in mind the cost if this needs to be aborted is not dramatic, but the payoff is potentially huge by simplifying many chunks of low-level codebase (and permitting pretty substantial computational benefits).

          Show
          benedict Benedict added a comment - It sounds like we're approaching agreement on our terms of disagreement, which is great I'm just not sure what is intrinsically less incremental in what I'm suggesting. The relatively simple (but still time consuming) solution you seem to be advocating sounds like one we would obsolete entirely once the full featureset I want to deliver is completed, since it is less efficient. It's quite possible it wouldn't even translate in the way you imagine, since the indexing is very different in the new scheme and so the current approach would not be possible (we need to implement some kind of tree structure for search in clustering columns, at the very least, since we will not have any of the current KeyCache machiner, and limiting ourselves to byte-ordered types permits us to not only use but assume especially efficient tries that are both more efficient on disk but also algorithmically more efficient for merging on read). But, either way, this obsolescence seems to largely get in the way of the real end-state we're aiming for, by introducing unnecessary work. I want incremental and non-superfluous deliverables en route. As I stated above I might be willing to aim for a row-oriented approach first (although incrementally delivered, so likely not supporting everything in one step, simply because this is easier and more digestable) which, once fully delivered, would support everything - except this implementation I want to aim for would be for byte-ordered clustering columns only. If instead we want to support non-byte-ordered, we either have to do more work to deliver an optimal result (i.e. a complex hybrid btree/trie-variant), which is distinctly less incremental, or deliver a suboptimal result that only exists temporarily until optimised, so is wasted effort. Either seem to run the risk of a reduction in final featureset when done earlier in the process, simply by consuming precious time, so I prefer to implement these features much later, if at all (like I said, I strongly favour dropping support altogether), since at last minute we can simply deliver an average/poor implementation if necessary. Especially given we know of no use cases outside of CQL types, to my knowledge, and we'll be eliminating those from the equation. In summary, we need to agree on your points (1) and (2) with my new comments, but also need (1a) and (2a): (1a): Do we want to try and eliminate this early in the process, or should we aim our end state for eliminating it and only fill in our suboptimal solution we'd deliver earlier if we are running out of time? (2a): Does 'everything' include non-byte-ordered types, or if we can try to EOL them. Bearing in mind the cost if this needs to be aborted is not dramatic, but the payoff is potentially huge by simplifying many chunks of low-level codebase (and permitting pretty substantial computational benefits).
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          It was my understanding that CASSANDRA-7743 was a temporary solution, there only until we migrate all the date to the new format. And we absolutely do not want to keep two or more formats indefinitely - getting rid of the old format is one of the goals.

          As for the new format itself - yes, we do need it to support all the use cases (because we want to ultimately have just one format, with support for all the things).

          Question is how to deliver it incrementally while making sure that it will eventually (and soon enough) support all the things.

          I’d settle for listing all the features the initial format won’t support in the beginning, and then detailed solutions for each of those features - but at that point we might as well code it up. That, and I’m not sure we can guarantee we can even make such a list. There are many cases, obscure ones, that nevertheless must be supported - empty values in primitive types, mixed dynamic/static ‘thrift CFs’, non byteorder-comparable, incl. custom, types. Having an upgrade path for all of those is a strict requirement that you can’t simply wish away (very unfortunately). It would be unacceptable to get close to 3.0 and only then realize that oops, there is a use case that we can’t implement with the new format, gotta stick with keeping both formats forever.

          What would even be the logistics w/ a ‘feature’ not implemented by a new format? If the format is set to NEW in the yaml, and someone writes an empty BB to an int column - what do we do on flush? If someone adds a non-byteorder-comparable column, or a collection - at what stage do we reject that?

          That said, I think we are all in agreement that the current format can and should be replaced. And I agree that we should go for the big win here. I just think that the real big win is the one that would benefit all of our users (and “avoiding repeating clustering columns, using a compact encoding for CQL column names, and use offset-encoding for timestamps” benefits everyone - 1 and 3 even dense table users).

          This way we can deliver something strictly better than the old format to everyone, without the risk of getting stuck with two formats forever, and optimize for the special cases later.

          Show
          iamaleksey Aleksey Yeschenko added a comment - It was my understanding that CASSANDRA-7743 was a temporary solution, there only until we migrate all the date to the new format. And we absolutely do not want to keep two or more formats indefinitely - getting rid of the old format is one of the goals. As for the new format itself - yes, we do need it to support all the use cases (because we want to ultimately have just one format, with support for all the things). Question is how to deliver it incrementally while making sure that it will eventually (and soon enough) support all the things. I’d settle for listing all the features the initial format won’t support in the beginning, and then detailed solutions for each of those features - but at that point we might as well code it up. That, and I’m not sure we can guarantee we can even make such a list. There are many cases, obscure ones, that nevertheless must be supported - empty values in primitive types, mixed dynamic/static ‘thrift CFs’, non byteorder-comparable, incl. custom, types. Having an upgrade path for all of those is a strict requirement that you can’t simply wish away (very unfortunately). It would be unacceptable to get close to 3.0 and only then realize that oops, there is a use case that we can’t implement with the new format, gotta stick with keeping both formats forever. What would even be the logistics w/ a ‘feature’ not implemented by a new format? If the format is set to NEW in the yaml, and someone writes an empty BB to an int column - what do we do on flush? If someone adds a non-byteorder-comparable column, or a collection - at what stage do we reject that? That said, I think we are all in agreement that the current format can and should be replaced. And I agree that we should go for the big win here. I just think that the real big win is the one that would benefit all of our users (and “avoiding repeating clustering columns, using a compact encoding for CQL column names, and use offset-encoding for timestamps” benefits everyone - 1 and 3 even dense table users). This way we can deliver something strictly better than the old format to everyone, without the risk of getting stuck with two formats forever, and optimize for the special cases later.
          Hide
          benedict Benedict added a comment -

          To clarify, the only feature I've suggested not supporting is non-byte-order comparable types (so, custom types are the only real point here, if they aren't byte-order comparable - but we could offer an upgrade path that requires users to provide a function that translates their non-byte-ordered variant to a byte-ordered variant). I don't want to support the old format to maintain support for these, I want to drop support altogether to simplify some of our tight loop code paths, guarantee efficient merging on read, and reduce implementation complexity for efficient operation in the new file format.

          Supporting empty values in int columns is something we'll definitely support out of the box. However later down the line I would like to roll out data types that do NOT support empty values and which we can then optimise this away, however it is not a huge issue since the new formats will encode all of this information in a bitmap.

          Show
          benedict Benedict added a comment - To clarify, the only feature I've suggested not supporting is non-byte-order comparable types (so, custom types are the only real point here, if they aren't byte-order comparable - but we could offer an upgrade path that requires users to provide a function that translates their non-byte-ordered variant to a byte-ordered variant). I don't want to support the old format to maintain support for these, I want to drop support altogether to simplify some of our tight loop code paths, guarantee efficient merging on read, and reduce implementation complexity for efficient operation in the new file format. Supporting empty values in int columns is something we'll definitely support out of the box. However later down the line I would like to roll out data types that do NOT support empty values and which we can then optimise this away, however it is not a huge issue since the new formats will encode all of this information in a bitmap.
          Hide
          benedict Benedict added a comment -

          Also, CASSANDRA-7743 is mostly unrelated to this I'm pretty sure

          Show
          benedict Benedict added a comment - Also, CASSANDRA-7743 is mostly unrelated to this I'm pretty sure
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Also, CASSANDRA-7743 is mostly unrelated to this I'm pretty sure

          Meant CASSANDRA-7443

          Show
          iamaleksey Aleksey Yeschenko added a comment - Also, CASSANDRA-7743 is mostly unrelated to this I'm pretty sure Meant CASSANDRA-7443
          Hide
          benedict Benedict added a comment -

          To clarify, further, my complete list of distinct featuresets we can deliver:

          • container format, which places relevant blocks of following components optimally using a complex strategy I will outline
          • partition index
            • hashed (phase 1)
            • byte-ordered (phase 2)
          • clustering column indexes
            • byte-ordered
            • [non-byte-ordered, basic]
            • [non-byte-ordered, btree]
            • [non-byte-ordered/byte-ordered hybrid]
          • value encoding
            • row oriented
            • column oriented
              • some extended variants on functionality of column oriented
            • row/column hybrid (actually possibly may retire this goal now; given latest approach it may not be sufficiently worth delivering)
            • efficient collection and complex UDT encoding/decoding (specifically, these will NOT inflate at the clustering column index level)

          All other features not mentioned are assumed to be supported, and if we've missed something we'll have to deal with it somehow. If there's some kind of nook and cranny I'm not currently aware of that will bite us later, please air it now for discussion, as it will at least help design.

          Anyway, we could deliver some kind of [non-byte-ordered,basic] implementation first time around which would deliver most functionality, just very suboptimally. If we want to deliver row-oriented first and want to support everything we need to also deliver efficient collection/UDT encoding/decoding.

          However delivering a non-trie based index means we inflate the amount of code dramatically, especially on read/merge/search, and introduce a great deal more conditional complexity since we cannot assume. I would strongly prefer to avoid this, given it yields little benefit and is avoidable at this crossroads.

          Show
          benedict Benedict added a comment - To clarify, further, my complete list of distinct featuresets we can deliver: container format, which places relevant blocks of following components optimally using a complex strategy I will outline partition index hashed (phase 1) byte-ordered (phase 2) clustering column indexes byte-ordered [non-byte-ordered, basic] [non-byte-ordered, btree] [non-byte-ordered/byte-ordered hybrid] value encoding row oriented column oriented some extended variants on functionality of column oriented row/column hybrid (actually possibly may retire this goal now; given latest approach it may not be sufficiently worth delivering) efficient collection and complex UDT encoding/decoding (specifically, these will NOT inflate at the clustering column index level) All other features not mentioned are assumed to be supported, and if we've missed something we'll have to deal with it somehow. If there's some kind of nook and cranny I'm not currently aware of that will bite us later, please air it now for discussion, as it will at least help design. Anyway, we could deliver some kind of [non-byte-ordered,basic] implementation first time around which would deliver most functionality, just very suboptimally. If we want to deliver row-oriented first and want to support everything we need to also deliver efficient collection/UDT encoding/decoding. However delivering a non-trie based index means we inflate the amount of code dramatically, especially on read/merge/search, and introduce a great deal more conditional complexity since we cannot assume. I would strongly prefer to avoid this, given it yields little benefit and is avoidable at this crossroads.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          we could offer an upgrade path that requires users to provide a function that translates their non-byte-ordered variant to a byte-ordered variant

          I'm not sure if that counts as 'offer an upgrade path' - we can't guarantee that every type can be made byte-order-comparable. But forget custom types. How can you make, for example, a (frozen) set of blobs, on its own or inside a UDT, byte-order-comparable? And those are already a thing in 2.1.

          Show
          iamaleksey Aleksey Yeschenko added a comment - we could offer an upgrade path that requires users to provide a function that translates their non-byte-ordered variant to a byte-ordered variant I'm not sure if that counts as 'offer an upgrade path' - we can't guarantee that every type can be made byte-order-comparable. But forget custom types. How can you make, for example, a (frozen) set of blobs, on its own or inside a UDT, byte-order-comparable? And those are already a thing in 2.1.
          Hide
          slebresne Sylvain Lebresne added a comment -

          To clarify, the only feature I've suggested not supporting is non-byte-order comparable types

          And I reiter my strong -1 on that at this point. It's also not just custom types imo: we allow UDT in comparators that could include collections, and we will allow serialized collection more generally. And while we "might" be able to make those byte-order comparable at some point, I suspect it's not trivial at all. Also, to the best of my knowledge, we have no clear well defined and tested way to convert all CQL types to byte-order, and for some (decimal, varint, uuid (whose comparison depends on the uuid type)), it's far from trivial. So outside of the fact that, again, I'm -1 on dropping support for non-byte-order types altogether, it also seems to me that covering all but custom types for 3.0 is a fairly risky bet. I somewhat doubt that starting with a more general, if less efficient, indexing for clustering columns will be a big waste of efforts.

          Show
          slebresne Sylvain Lebresne added a comment - To clarify, the only feature I've suggested not supporting is non-byte-order comparable types And I reiter my strong -1 on that at this point. It's also not just custom types imo: we allow UDT in comparators that could include collections, and we will allow serialized collection more generally. And while we "might" be able to make those byte-order comparable at some point, I suspect it's not trivial at all. Also, to the best of my knowledge, we have no clear well defined and tested way to convert all CQL types to byte-order, and for some (decimal, varint, uuid (whose comparison depends on the uuid type)), it's far from trivial. So outside of the fact that, again, I'm -1 on dropping support for non-byte-order types altogether, it also seems to me that covering all but custom types for 3.0 is a fairly risky bet. I somewhat doubt that starting with a more general, if less efficient, indexing for clustering columns will be a big waste of efforts.
          Hide
          benedict Benedict added a comment - - edited

          OK, you've convinced me. We certainly could make collections byte-order comparable, but you're right that it would be unpleasant to do so.

          On second thoughts, I'd say this isn't that unpleasant. Not ideal, but also since it's not an efficient scenario anyway, incurring a slight serialization penalty would not be very costly. I'm a bit more on the fence, but there's a great deal of long term efficiency if we can deliver only byte-order comparable clustering columns, and a simplified code base.

          However using these as clustering columns is already inherently inefficient, so I propose we stick to delivering a relatively average-to-poor implementation which is dropped in automatically only when the types aren't all byte-order comparable.

          I'm pretty sure we can make decimal, varint and each uuid type byte-order comparable.

          Show
          benedict Benedict added a comment - - edited OK, you've convinced me. We certainly could make collections byte-order comparable, but you're right that it would be unpleasant to do so. On second thoughts, I'd say this isn't that unpleasant. Not ideal, but also since it's not an efficient scenario anyway, incurring a slight serialization penalty would not be very costly. I'm a bit more on the fence, but there's a great deal of long term efficiency if we can deliver only byte-order comparable clustering columns, and a simplified code base. However using these as clustering columns is already inherently inefficient, so I propose we stick to delivering a relatively average-to-poor implementation which is dropped in automatically only when the types aren't all byte-order comparable. I'm pretty sure we can make decimal, varint and each uuid type byte-order comparable.
          Hide
          benedict Benedict added a comment -

          Aleksey Yeschenko: CASSANDRA-7443 is unlikely to be removed once we obsolete the old file format. It only continues to help us preparing gen 4.0, which may be even more dramatic than 3.0

          Show
          benedict Benedict added a comment - Aleksey Yeschenko : CASSANDRA-7443 is unlikely to be removed once we obsolete the old file format. It only continues to help us preparing gen 4.0, which may be even more dramatic than 3.0
          Hide
          slebresne Sylvain Lebresne added a comment -

          I propose we stick to delivering a relatively average-to-poor implementation which is dropped in automatically only when the types aren't all byte-order comparable

          I don't have have a problem with that. I merely don't want to maintain 2 file formats, and so I want to prioritize having "some" indexing that don't assume byte-order. Happy to keep it relatively simple so we have time to add the byte-oder index part too (I don't think adding a fallback general indexing will delay us much if we keep it simple (we can always optimize it latter anyway if we want) and I'm happy to said that we'll hold 3.0 for the byte-ordered part if we're slightly late). The advantage I see being also that we'll feel a little bit less in a rush to deliver conversions to byte-order for all types (typically, maybe decimal will prove challanging and ugly, but it's also not all that often used in clustering columns so it's totally ok if we don't have a conversion for 3.0).

          Show
          slebresne Sylvain Lebresne added a comment - I propose we stick to delivering a relatively average-to-poor implementation which is dropped in automatically only when the types aren't all byte-order comparable I don't have have a problem with that. I merely don't want to maintain 2 file formats, and so I want to prioritize having "some" indexing that don't assume byte-order. Happy to keep it relatively simple so we have time to add the byte-oder index part too (I don't think adding a fallback general indexing will delay us much if we keep it simple (we can always optimize it latter anyway if we want) and I'm happy to said that we'll hold 3.0 for the byte-ordered part if we're slightly late). The advantage I see being also that we'll feel a little bit less in a rush to deliver conversions to byte-order for all types (typically, maybe decimal will prove challanging and ugly, but it's also not all that often used in clustering columns so it's totally ok if we don't have a conversion for 3.0).
          Hide
          benedict Benedict added a comment -

          OK, after a bit more consideration I'm willing to accept that tack.

          We still need to get wider discussion about if we want to prioritise column oriented (which is likely to yield the largest performance implications) over row oriented. The only major reason this might be beneficial is that it actually inherently supports fewer features, so we don't need to deliver all of the complexity out of the box (so we can deliver it faster). I don't have a strong position either way, however. Assuming we deliver row-oriented in time, we can likely drop column oriented quite quickly afterwards as well.

          Show
          benedict Benedict added a comment - OK, after a bit more consideration I'm willing to accept that tack. We still need to get wider discussion about if we want to prioritise column oriented (which is likely to yield the largest performance implications) over row oriented. The only major reason this might be beneficial is that it actually inherently supports fewer features, so we don't need to deliver all of the complexity out of the box (so we can deliver it faster). I don't have a strong position either way, however. Assuming we deliver row-oriented in time, we can likely drop column oriented quite quickly afterwards as well.
          Hide
          jkrupan Jack Krupansky added a comment -

          Can you guys provide two very simple table definitions and corresponding CQL queries that exemplify row vs. columnar storage and processing optimality?

          IOW, the two key test cases that would confirm the extent to which the goals of this issue are met, although that might include a narrative about how much updating and deletions are impacting storage and performance as well.

          Show
          jkrupan Jack Krupansky added a comment - Can you guys provide two very simple table definitions and corresponding CQL queries that exemplify row vs. columnar storage and processing optimality? IOW, the two key test cases that would confirm the extent to which the goals of this issue are met, although that might include a narrative about how much updating and deletions are impacting storage and performance as well.
          Hide
          jkrupan Jack Krupansky added a comment -

          I'm also interested in the impact of the proposal on storage and performance on:

          1. Secondary indexes - either native Cassandra, manually maintained index tables, or even DSE/Solr indexing. Would it make them faster, more compact, less needed, or no net impact?

          2. Filtering - Would it enable more/faster filtering, discourage it, or no net change?

          Show
          jkrupan Jack Krupansky added a comment - I'm also interested in the impact of the proposal on storage and performance on: 1. Secondary indexes - either native Cassandra, manually maintained index tables, or even DSE/Solr indexing. Would it make them faster, more compact, less needed, or no net impact? 2. Filtering - Would it enable more/faster filtering, discourage it, or no net change?
          Hide
          benedict Benedict added a comment -

          Columnar storage is best for wide-tables storing wide-rows - i.e., tables with partitions that span multiple pages on disk and with many columns defined, which are not all accessed simultaneously (e.g. a table of various stock pricing, volume and related data). It is most helpful when this kind of data is accessed for analytics (esp., say, Spark performing token aware aggregations), but offers advantages for simple queries also, simply by reducing the number of pages that need to be read. The right kinds of data can also require substantially less disk space, as compression operating over a more uniform dataset (i.e. one column) is especially effective, as is delta encoding in some scenarios.

          Show
          benedict Benedict added a comment - Columnar storage is best for wide-tables storing wide-rows - i.e., tables with partitions that span multiple pages on disk and with many columns defined, which are not all accessed simultaneously (e.g. a table of various stock pricing, volume and related data). It is most helpful when this kind of data is accessed for analytics (esp., say, Spark performing token aware aggregations), but offers advantages for simple queries also, simply by reducing the number of pages that need to be read. The right kinds of data can also require substantially less disk space, as compression operating over a more uniform dataset (i.e. one column) is especially effective, as is delta encoding in some scenarios.
          Hide
          jkrupan Jack Krupansky added a comment -

          Thanks, Benedict, so Just to paraphrase, "columnar" is referring to the CQL use of "clustering columns" - multiple/many CQL rows per partition, and "row-oriented" is referring to a primary key consisting of only partition key columns with no clustering columns, so that "row-oriented" means only a single CQL row per partition, right?

          One clarification, does the delta encoding require that each CQL row have only one column, so that each adjacent "cell" is for the same CQL column, or... is delta-coding effective when the CQL row has a sequence of columns that map to a repeating sequence of adjacent cells, but the cells for a particular CQL column are never immediately adjacent in the partition?

          Show
          jkrupan Jack Krupansky added a comment - Thanks, Benedict , so Just to paraphrase, "columnar" is referring to the CQL use of "clustering columns" - multiple/many CQL rows per partition, and "row-oriented" is referring to a primary key consisting of only partition key columns with no clustering columns, so that "row-oriented" means only a single CQL row per partition, right? One clarification, does the delta encoding require that each CQL row have only one column, so that each adjacent "cell" is for the same CQL column, or... is delta-coding effective when the CQL row has a sequence of columns that map to a repeating sequence of adjacent cells, but the cells for a particular CQL column are never immediately adjacent in the partition?
          Hide
          benedict Benedict added a comment -

          Jack Krupansky when we say columnar, we mean the general "columnar database" (not column family!) sense of the word. Either row or columnar-oriented can have clustering columns, but without clustering columns, columnar storage makes very little sense as each partition will contain only one row (and so there is no optimisation to be made for accessing multiple rows for only a subset of columns). It could still be sensible if the data were sufficiently similar across partitions that a high compression impact could be had, and we will likely support some hybrid of row/columnar, which might be especially applicable for such use cases, but in general you should be thinking timeseries data and the like for utilising columnar storage.

          Delta encoding (once delivered) won't have any such requirement, no. It will operate over a single column's adjacent values, i.e. the values from each adjacent row for a given column

          Show
          benedict Benedict added a comment - Jack Krupansky when we say columnar, we mean the general "columnar database" (not column family!) sense of the word. Either row or columnar-oriented can have clustering columns, but without clustering columns, columnar storage makes very little sense as each partition will contain only one row (and so there is no optimisation to be made for accessing multiple rows for only a subset of columns). It could still be sensible if the data were sufficiently similar across partitions that a high compression impact could be had, and we will likely support some hybrid of row/columnar, which might be especially applicable for such use cases, but in general you should be thinking timeseries data and the like for utilising columnar storage. Delta encoding (once delivered) won't have any such requirement, no. It will operate over a single column's adjacent values, i.e. the values from each adjacent row for a given column
          Hide
          benedict Benedict added a comment -

          Sylvain Lebresne, one thing I want to raise more explicitly (I alluded to, but want to make sure there's plenty of clarity of intent to open debate if necessary):

          In the new storage I want to move away from relying on overloading our comparators to achieve all of our functionality, and deliver slightly "special cased" implementations supporting these features, namely collections and static columns, but likely also non-frozen UDT functionality as well. This will be built on top of primarily existing functionality, but the more explicit we are at managing these features at the storage level the more efficient the end result is likely to be. In particular, collections will no longer live in the clustering area at all, and will be encoded entirely in the value area, but will be queryable within that. Static columns may be stored completely separately to support improved caching and faster/cheaper compaction.

          The upshot is that, in the medium term, this will hopefully feed back into the rest of Cassandra as well, although the time horizon for that completing may be lengthy (since memtables and cells are unlikely to be replaced in time for 3.0). Eventually I would like to see comparators return to simplicity.

          Show
          benedict Benedict added a comment - Sylvain Lebresne , one thing I want to raise more explicitly (I alluded to, but want to make sure there's plenty of clarity of intent to open debate if necessary): In the new storage I want to move away from relying on overloading our comparators to achieve all of our functionality, and deliver slightly "special cased" implementations supporting these features, namely collections and static columns, but likely also non-frozen UDT functionality as well. This will be built on top of primarily existing functionality, but the more explicit we are at managing these features at the storage level the more efficient the end result is likely to be. In particular, collections will no longer live in the clustering area at all, and will be encoded entirely in the value area, but will be queryable within that. Static columns may be stored completely separately to support improved caching and faster/cheaper compaction. The upshot is that, in the medium term, this will hopefully feed back into the rest of Cassandra as well, although the time horizon for that completing may be lengthy (since memtables and cells are unlikely to be replaced in time for 3.0). Eventually I would like to see comparators return to simplicity.
          Hide
          benedict Benedict added a comment -

          Attached is a first draft of a semi-formal proposal for achieving this and related goals.

          Show
          benedict Benedict added a comment - Attached is a first draft of a semi-formal proposal for achieving this and related goals.
          Hide
          benedict Benedict added a comment -

          Reattaching with some blanks filled in

          Show
          benedict Benedict added a comment - Reattaching with some blanks filled in
          Hide
          velvia Evan Chan added a comment -

          Hey guys,

          I'm working on a project to implement a columnar database on top of Cassandra, so I found this really interesting. I have a repo comparing regular CQL layout with a columnar layout, and I can confirm Benedict's comments – for certain queries am seeing over two orders of magnitude speedup:

          https://github.com/velvia/cassandra-gdelt

          It would be super useful to have an option for a columnar layout, or to have that portion of the storage format be pluggable somehow.

          I'm splitting out my research into a zero-serialization, high performance vector binary format into a separate repo, drop me a note if you guys are interested.

          Show
          velvia Evan Chan added a comment - Hey guys, I'm working on a project to implement a columnar database on top of Cassandra, so I found this really interesting. I have a repo comparing regular CQL layout with a columnar layout, and I can confirm Benedict 's comments – for certain queries am seeing over two orders of magnitude speedup: https://github.com/velvia/cassandra-gdelt It would be super useful to have an option for a columnar layout, or to have that portion of the storage format be pluggable somehow. I'm splitting out my research into a zero-serialization, high performance vector binary format into a separate repo, drop me a note if you guys are interested.
          Hide
          jeromatron Jeremy Hanna added a comment -

          So just to clarify, is this approach going to make the cql types more native to the storage format so that things like collections are going to be more performant? I've seen instances where use of collections make compaction throughput peg the CPU (latter) and bottleneck Cassandra. A more native approach would solve that, but that depends on how it's structured. If we're doing nestable collections, UDTs, and JSON in some form, does that negate the possibility of doing more native types and hence making more complicated structures less performant?

          Show
          jeromatron Jeremy Hanna added a comment - So just to clarify, is this approach going to make the cql types more native to the storage format so that things like collections are going to be more performant? I've seen instances where use of collections make compaction throughput peg the CPU (latter) and bottleneck Cassandra. A more native approach would solve that, but that depends on how it's structured. If we're doing nestable collections, UDTs, and JSON in some form, does that negate the possibility of doing more native types and hence making more complicated structures less performant?
          Hide
          benedict Benedict added a comment -

          If the dimensionality of the map is fixed, I don't see why not. Certainly much closer, anyway.

          Show
          benedict Benedict added a comment - If the dimensionality of the map is fixed, I don't see why not. Certainly much closer, anyway.
          Hide
          michaelsembwever mck added a comment -

          Bumping to fix version 4.x, as 3.11.0 is a bug-fix only release.
            ref https://s.apache.org/EHBy

          Show
          michaelsembwever mck added a comment - Bumping to fix version 4.x, as 3.11.0 is a bug-fix only release.   ref https://s.apache.org/EHBy
          Hide
          kohlisankalp sankalp kohli added a comment -

          Anyone interested in working on this?

          Show
          kohlisankalp sankalp kohli added a comment - Anyone interested in working on this?
          Hide
          velvia Evan Chan added a comment -

          Finally someone is responding here

          Show
          velvia Evan Chan added a comment - Finally someone is responding here

            People

            • Assignee:
              kohlisankalp sankalp kohli
              Reporter:
              benedict Benedict
            • Votes:
              3 Vote for this issue
              Watchers:
              59 Start watching this issue

              Dates

              • Created:
                Updated:

                Development