Cassandra
  1. Cassandra
  2. CASSANDRA-4175

Reduce memory, disk space, and cpu usage with a column name/id map

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Fix Version/s: 3.0
    • Component/s: None
    • Labels:

      Description

      We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap.

      The overhead is significant for typical small column values, e.g., ints.

      Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity.

      Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client.

        Issue Links

          Activity

          Hide
          Benedict added a comment -

          See also CASSANDRA-6917 - IMO the best solution to this problem is an enum data type, and then to convert all column names to that type.

          Show
          Benedict added a comment - See also CASSANDRA-6917 - IMO the best solution to this problem is an enum data type, and then to convert all column names to that type.
          Hide
          Benedict added a comment -

          I think it could be a big win from a CPU pov just to have a transient (per launch, per node) map. On the assumption that we convert back via a single array lookup, the extra indirection cost is unlikely to be measurable, but if we were to precompute the comparisons of the ByteBuffer names we would definitely save O(name.length()) operations per task, but could potentially switch to counting sort and save O(m.n.lg) [where n is the number of columns involved in an operation, and m is the length of the column names] for CFs with, say, < 100 columns.

          It could potentially be implemented by abstracting Column to allow different sources of name(), so that CFs with large numbers of column names, or TimeUUID comparators, etc. can remain with the current implementation. Obviously with care taken not to break the native protocol...

          Show
          Benedict added a comment - I think it could be a big win from a CPU pov just to have a transient (per launch, per node) map. On the assumption that we convert back via a single array lookup, the extra indirection cost is unlikely to be measurable, but if we were to precompute the comparisons of the ByteBuffer names we would definitely save O(name.length()) operations per task, but could potentially switch to counting sort and save O(m.n.lg ) [where n is the number of columns involved in an operation, and m is the length of the column names] for CFs with, say, < 100 columns. It could potentially be implemented by abstracting Column to allow different sources of name(), so that CFs with large numbers of column names, or TimeUUID comparators, etc. can remain with the current implementation. Obviously with care taken not to break the native protocol...
          Hide
          Sylvain Lebresne added a comment -

          I'm pretty sure we'd need CASSANDRA-5417 to make that doable (in fact, that's one of my original motivation for doing CASSANDRA-5417). Namely, we don't want a cell name/id map, we want a cql3 column/id map, otherwise this loose most interest. And we can't do a cql3 column/id map if we store cell name as opaque byte buffers.

          To be more precise, I don't deny that a cell name/id map could be a start and would in fact server some use cases, but I'm a bit reluctant in implementing that knowing that we want to change to a cql3 column/id map sooner than later because I suspect it'll be a lot easier to do "the right thing" to start with rather than doing cell name/id map and then have a painful time to switch a cql3 column/id one without breaking backward compatibility.

          Besides, I also suspect there is a bunch of refactorings that are in CASSANDRA-5417 that would be needed here as well, so working on both separately without coordination is likely to be frustrating and a duplication of effort.

          Anyway, I do plan on getting back to CASSANDRA-5417 asap (though it will unlikely be like next week) so maybe we can hold a bit on that one until then? If I've made no progress on CASSANDRA-5417 in say a month or two, and people really want this, we can re-evaluate?

          Show
          Sylvain Lebresne added a comment - I'm pretty sure we'd need CASSANDRA-5417 to make that doable (in fact, that's one of my original motivation for doing CASSANDRA-5417 ). Namely, we don't want a cell name/id map, we want a cql3 column/id map, otherwise this loose most interest. And we can't do a cql3 column/id map if we store cell name as opaque byte buffers. To be more precise, I don't deny that a cell name/id map could be a start and would in fact server some use cases, but I'm a bit reluctant in implementing that knowing that we want to change to a cql3 column/id map sooner than later because I suspect it'll be a lot easier to do "the right thing" to start with rather than doing cell name/id map and then have a painful time to switch a cql3 column/id one without breaking backward compatibility. Besides, I also suspect there is a bunch of refactorings that are in CASSANDRA-5417 that would be needed here as well, so working on both separately without coordination is likely to be frustrating and a duplication of effort. Anyway, I do plan on getting back to CASSANDRA-5417 asap (though it will unlikely be like next week) so maybe we can hold a bit on that one until then? If I've made no progress on CASSANDRA-5417 in say a month or two, and people really want this, we can re-evaluate?
          Hide
          Jonathan Ellis added a comment -

          Has it become easier to get to know sstable version numbers in the serializer class now?

          I could maybe check if someone in the team here would like to take a stab at moving this to latest cassandra and commit it if the above implementation seems interesting.

          That would be great. Yes, you'll see Descriptor.Version being passed around now which is what encapsulates what kind of sstable it is, including to the lowest level of Column.onDiskIterator.

          Show
          Jonathan Ellis added a comment - Has it become easier to get to know sstable version numbers in the serializer class now? I could maybe check if someone in the team here would like to take a stab at moving this to latest cassandra and commit it if the above implementation seems interesting. That would be great. Yes, you'll see Descriptor.Version being passed around now which is what encapsulates what kind of sstable it is, including to the lowest level of Column.onDiskIterator.
          Hide
          Terje Marthinussen added a comment -

          I should maybe add, 1 and 2 above does not exclude but rather complement each other.

          #1 is a manual map and could allow things like a prefix map such as '$201212' which will map all such prefixes to an id

          #2 is a auto map. It may require 1 if we want to consider to allow user to give "hints" to substring maps such as '$(201\d\d\d)' to map all year+month like string starting on 201 to a mapping entry. This will just be a hint. The sampling of number of entries should decide what gets mapped to avoid running out of memory.

          I am a bit unsure if these advanced features like substrings would never be used and should maybe only be implemented as some sort of substring detection separately.

          As this can be a bit processing intensive, substring statistics (top substrings) could be detected and maintained node wide in compaction and given as hints to the serializer later.

          Show
          Terje Marthinussen added a comment - I should maybe add, 1 and 2 above does not exclude but rather complement each other. #1 is a manual map and could allow things like a prefix map such as '$201212' which will map all such prefixes to an id #2 is a auto map. It may require 1 if we want to consider to allow user to give "hints" to substring maps such as '$(201\d\d\d)' to map all year+month like string starting on 201 to a mapping entry. This will just be a hint. The sampling of number of entries should decide what gets mapped to avoid running out of memory. I am a bit unsure if these advanced features like substrings would never be used and should maybe only be implemented as some sort of substring detection separately. As this can be a bit processing intensive, substring statistics (top substrings) could be detected and maintained node wide in compaction and given as hints to the serializer later.
          Hide
          Terje Marthinussen added a comment - - edited

          Hi,

          Sorry for the late update.

          Yes, we have a cluster with some 20-30 billion columns (maybe even closer to 40 billion by now) which implements a column name map and has been in production for about 2 years.

          I was actually looking at committing this 2 years ago together with fairly large number of other changes which was implemented in the column/supercolumn serializer code but I never got around to implement a good way to push the sstable version numbers into the serializer to make things backwards compatible before focus moved resources elsewhere.

          As mentioned above by others, while not benchmarked and proven, I had a very good feeling the total change helped quite a bit on GC issues, memtables and a bit on performance in general, but in terms of disk space, the benefit was somewhat limited after sstable compression was implemented as the repeating column names are compressed pretty well.

          This is already 2 years ago (the cluster still runs by the way), but if memory serves me right:
          30-40% reduction in disk space without compression
          10% reduction on top of compression (I did a test after it was implemented).

          In my case, the implementation is actually hardcoded due to time constraints.
          A static map which is global for the entire cassandra installation.

          If committing this into cassandra, I believe my plan was split in 3.
          Possible as 3 different implementation stages:

          1. A simple config option (as a config file or as a columnfamily) where users themselves can assign repeating column names. Sure, it is not as fancy as many other options, but maybe we could open up to cover some strange corner case usages here with things like substrings as well.

          Think options to cover complex versions of patterns like date/times such as 20130701202020 where a large chunk of the column name repeats, but not all of it.

          In the current implementation, if there is a mapping entry, it converts the string to a variable length integer which becomes the new column name. If there is no mapping entry, it stores the raw data.

          In our case, we have <40 repeating column names so I never need more than a 1 byte varint.

          I also modified the column format to add a "column feature bitmap" at the start of each column. This allowed me to turn on/off name/id mapping as well as things like TTL's and a handful of other meta data.

          There is a bunch of 64 bit numbers in the column format which only have default value in 99.999% of all cases and very often your column value is just an 8 byte int, a boolean or a short text entry. That is, in most cases the column meta data is many times larger than the value stored.

          This would have been my first implementation. Mostly because I have a working implementation of it already and the mapping table would be very easy to move to a config file with just a list of column names read at cassandra startup, or stored in a similar way to column family and other internal config (just as another keyspace for config). Unfortunately, it is a little bit work also to push such config data down to the serializer. At least as the code was organized 2 years ago.

          Notice again, you do not need any sort of atomic handling of the updates to the map in any way in this implementation. You can add map entries at any time. The result after deserializing is always the same as column names can have a mix of raw and map id values thanks to the "column feature bitmap" that was introduced.

          Entries that was stored as raw strings will eventually be replaced by ID's to the map as compaction clean things up.

          2. Auto learning feature with mapping table per sstable.
          This would be stage 2 of the implementation.

          When starting to create a new SSTable, build a sampling of the most frequently occuring column names and gradually start mapping them to ID's.

          Add the mapping table to the end of the SSTable or in a separate .map file (similar to index files) at the completion of sstable generation.

          The initial id mapping could be further improved by maintaining a global map of column names. This "global map" would not be used for serialization/deserialization. It would be used to pre-populate the value for a sstable and would only be statistics to optimize things further by reducing the number of mapping variances between sstables and reducing the number of raw values getting stored a bit more.

          The id map would still be local to each sstable in terms of storage, but having such statistics would allow you to dramatically reduce the size of a potentially shared id cache across sstables where a lot of mapping entries would be identical.

          Some may feel that we would run out of memory quickly or use a lot of extra disk with maps per sstable, but I guess that we only really need to deal with the top few thousand entries in each sstable and this would not be a problem to keep in a idmap cache in terms of size.

          This is really just the top X re-occuring column names or column name sub pattern

          If you have more unique column entries that this in a sstable, this will probably not be the feature that will save the day anyway as the benefit per column entry will be quite small vs. the overhead and the entire feature should potentially disable itself automagically if there is no frequently repeating patterns.

          3. I had some ideas for moving the mapping up from the serializer to allow things like streaming entries including id maps between nodes, but things do indeed quickly get ugly and I do not remember clearly how I had planned to do this.


          The reason I isolated the mapping function to the serializer is that it looked incredibly messy to move this further "up" in the stack. Column sorts, range scans, lookukups...

          Not fun at all and if the memtable is serialized anyway the memory consumption there and in disk cache is dramatically reduced.

          Also... with a global static map here at startup time, I actually share the mapped strings across most columns in memory anyway as I believe they all become pointers to my static complied in map (again, this gets a lot more trivial to make work very well if this is a startup config, but yes a bit less user unfriendly)

          I haven't looked at the cassandra code for way to long now.

          Has it become easier to get to know sstable version numbers in the serializer class now?

          I could maybe check if someone in the team here would like to take a stab at moving this to latest cassandra and commit it if the above implementation seems interesting.

          Part of it should be really easy to port as long as we can get a bit more info into the serializer/deserializer.

          Show
          Terje Marthinussen added a comment - - edited Hi, Sorry for the late update. Yes, we have a cluster with some 20-30 billion columns (maybe even closer to 40 billion by now) which implements a column name map and has been in production for about 2 years. I was actually looking at committing this 2 years ago together with fairly large number of other changes which was implemented in the column/supercolumn serializer code but I never got around to implement a good way to push the sstable version numbers into the serializer to make things backwards compatible before focus moved resources elsewhere. As mentioned above by others, while not benchmarked and proven, I had a very good feeling the total change helped quite a bit on GC issues, memtables and a bit on performance in general, but in terms of disk space, the benefit was somewhat limited after sstable compression was implemented as the repeating column names are compressed pretty well. This is already 2 years ago (the cluster still runs by the way), but if memory serves me right: 30-40% reduction in disk space without compression 10% reduction on top of compression (I did a test after it was implemented). In my case, the implementation is actually hardcoded due to time constraints. A static map which is global for the entire cassandra installation. If committing this into cassandra, I believe my plan was split in 3. Possible as 3 different implementation stages: 1. A simple config option (as a config file or as a columnfamily) where users themselves can assign repeating column names. Sure, it is not as fancy as many other options, but maybe we could open up to cover some strange corner case usages here with things like substrings as well. Think options to cover complex versions of patterns like date/times such as 20130701202020 where a large chunk of the column name repeats, but not all of it. In the current implementation, if there is a mapping entry, it converts the string to a variable length integer which becomes the new column name. If there is no mapping entry, it stores the raw data. In our case, we have <40 repeating column names so I never need more than a 1 byte varint. I also modified the column format to add a "column feature bitmap" at the start of each column. This allowed me to turn on/off name/id mapping as well as things like TTL's and a handful of other meta data. There is a bunch of 64 bit numbers in the column format which only have default value in 99.999% of all cases and very often your column value is just an 8 byte int, a boolean or a short text entry. That is, in most cases the column meta data is many times larger than the value stored. This would have been my first implementation. Mostly because I have a working implementation of it already and the mapping table would be very easy to move to a config file with just a list of column names read at cassandra startup, or stored in a similar way to column family and other internal config (just as another keyspace for config). Unfortunately, it is a little bit work also to push such config data down to the serializer. At least as the code was organized 2 years ago. Notice again, you do not need any sort of atomic handling of the updates to the map in any way in this implementation. You can add map entries at any time. The result after deserializing is always the same as column names can have a mix of raw and map id values thanks to the "column feature bitmap" that was introduced. Entries that was stored as raw strings will eventually be replaced by ID's to the map as compaction clean things up. 2. Auto learning feature with mapping table per sstable. This would be stage 2 of the implementation. When starting to create a new SSTable, build a sampling of the most frequently occuring column names and gradually start mapping them to ID's. Add the mapping table to the end of the SSTable or in a separate .map file (similar to index files) at the completion of sstable generation. The initial id mapping could be further improved by maintaining a global map of column names. This "global map" would not be used for serialization/deserialization. It would be used to pre-populate the value for a sstable and would only be statistics to optimize things further by reducing the number of mapping variances between sstables and reducing the number of raw values getting stored a bit more. The id map would still be local to each sstable in terms of storage, but having such statistics would allow you to dramatically reduce the size of a potentially shared id cache across sstables where a lot of mapping entries would be identical. Some may feel that we would run out of memory quickly or use a lot of extra disk with maps per sstable, but I guess that we only really need to deal with the top few thousand entries in each sstable and this would not be a problem to keep in a idmap cache in terms of size. This is really just the top X re-occuring column names or column name sub pattern If you have more unique column entries that this in a sstable, this will probably not be the feature that will save the day anyway as the benefit per column entry will be quite small vs. the overhead and the entire feature should potentially disable itself automagically if there is no frequently repeating patterns. 3. I had some ideas for moving the mapping up from the serializer to allow things like streaming entries including id maps between nodes, but things do indeed quickly get ugly and I do not remember clearly how I had planned to do this. — The reason I isolated the mapping function to the serializer is that it looked incredibly messy to move this further "up" in the stack. Column sorts, range scans, lookukups... Not fun at all and if the memtable is serialized anyway the memory consumption there and in disk cache is dramatically reduced. Also... with a global static map here at startup time, I actually share the mapped strings across most columns in memory anyway as I believe they all become pointers to my static complied in map (again, this gets a lot more trivial to make work very well if this is a startup config, but yes a bit less user unfriendly) I haven't looked at the cassandra code for way to long now. Has it become easier to get to know sstable version numbers in the serializer class now? I could maybe check if someone in the team here would like to take a stab at moving this to latest cassandra and commit it if the above implementation seems interesting. Part of it should be really easy to port as long as we can get a bit more info into the serializer/deserializer.
          Hide
          Edward Capriolo added a comment - - edited

          2995 says

          It could be advantageous for Cassandra to make the storage engine pluggable. This could allow Cassandra to

          deal with potential use cases where maybe the current sstables are not the best fit
          allow several types of internal storage formats (at the same time) optimized for different data types

          Since this issue talks about reducing disk space it will be changing how data is written, this seems to benefit people with mostly static column. It sounds right on the money with 2995. However it goes beyond storage layer changes.

          The feature makes a ton of sense and does not only benefit the cql3 case. Many people have static columns and since 0.7 standard column families have had schema as well.

          If cassandra had a 'plugable storage format'. One of the things it the 'ColumnMapIdStorageFormat' could do is write the known schema to a small file loaded in memory with each sstable, (like the bloom filter) that would contain the mappings. In the end I think you would have to store this anyway because the mappings would change over time and what is in the schema now may not be fully accurate for old slushed tables. This would only save storage as mentioned and the internode traffic could not be optimized with plugable storage alone.

          For compare and swap, well whatever, it's just one feature and no one has to use it if they do not want to. However requiring all schema changes to need zk is crazy scary to me. It is true that schema always needed to propagate before it can be used. I personally do not want to have to install zk side by side with all my cassandra installs, and I do not want to rely on it for schema changes.

          Architecturally building on zk is a house of cards. This was originally why I chose cassandra over hbase (hbase had meta data on hdfs, and state information with zk). The WORST thing that ever happens to cassandra is a node has a corrupt schema or a disagreement. I restart/decommission rejoin the node and it is fixed.

          If we start storing bits of information (column ids, schema in zookeeper) we become totally reliant on it, nodes may or may not be able to start up without it, we may or not be able to make schema changes without it, and MOST IMPORTANTLY, ITS A SPOF THAT WHEN IT GOES CORRUPT will likely cause the entire cluster to * die, or likely function in a way worse then death, something like writing (corrupt ids column to files and hopelessly corrupting everything).

          No thanks to any ZK integration. ZK and centrally managed meta data = hbase.

          Show
          Edward Capriolo added a comment - - edited 2995 says It could be advantageous for Cassandra to make the storage engine pluggable. This could allow Cassandra to deal with potential use cases where maybe the current sstables are not the best fit allow several types of internal storage formats (at the same time) optimized for different data types Since this issue talks about reducing disk space it will be changing how data is written, this seems to benefit people with mostly static column. It sounds right on the money with 2995. However it goes beyond storage layer changes. The feature makes a ton of sense and does not only benefit the cql3 case. Many people have static columns and since 0.7 standard column families have had schema as well. If cassandra had a 'plugable storage format'. One of the things it the 'ColumnMapIdStorageFormat' could do is write the known schema to a small file loaded in memory with each sstable, (like the bloom filter) that would contain the mappings. In the end I think you would have to store this anyway because the mappings would change over time and what is in the schema now may not be fully accurate for old slushed tables. This would only save storage as mentioned and the internode traffic could not be optimized with plugable storage alone. For compare and swap, well whatever, it's just one feature and no one has to use it if they do not want to. However requiring all schema changes to need zk is crazy scary to me. It is true that schema always needed to propagate before it can be used. I personally do not want to have to install zk side by side with all my cassandra installs, and I do not want to rely on it for schema changes. Architecturally building on zk is a house of cards. This was originally why I chose cassandra over hbase (hbase had meta data on hdfs, and state information with zk). The WORST thing that ever happens to cassandra is a node has a corrupt schema or a disagreement. I restart/decommission rejoin the node and it is fixed. If we start storing bits of information (column ids, schema in zookeeper) we become totally reliant on it, nodes may or may not be able to start up without it, we may or not be able to make schema changes without it, and MOST IMPORTANTLY, ITS A SPOF THAT WHEN IT GOES CORRUPT will likely cause the entire cluster to * die, or likely function in a way worse then death, something like writing (corrupt ids column to files and hopelessly corrupting everything). No thanks to any ZK integration. ZK and centrally managed meta data = hbase.
          Hide
          Jonathan Ellis added a comment -

          That is not what we are talking about.

          Show
          Jonathan Ellis added a comment - That is not what we are talking about.
          Hide
          Edward Capriolo added a comment -

          It also sounds like we are re-opening up the concept of plugable storage. https://issues.apache.org/jira/browse/CASSANDRA-2995 since we are talking about custom disk formats only good for specific use cases.

          Show
          Edward Capriolo added a comment - It also sounds like we are re-opening up the concept of plugable storage. https://issues.apache.org/jira/browse/CASSANDRA-2995 since we are talking about custom disk formats only good for specific use cases.
          Hide
          Jonathan Ellis added a comment -

          Edited title to reflect the obvious.

          Show
          Jonathan Ellis added a comment - Edited title to reflect the obvious.
          Hide
          Nate McCall added a comment -

          punt: let each node use a node-local map, and translate back and forth to full column name across node boundaries

          I would much prefer this approach. Particularly since I know of at least one large-ish cluster that had this working back in 0.8 after hacking it in by hand (Terje Marthinussen if you folks are still around, would like to know your thoughts on this issue).

          This also feels like a better bounding of the task since the goal is to reduce local memory consumption and GC activity.

          Another thought: why not make this a function of the comparator for type-specific encode/decode? This would make common encode/decodes of some types become extremely efficient while using 'int counter' approach mentioned for other types.

          Either way, a node local approach initially would not preclude the use of CAS in the future (hopefully, the encapsulation by such a 2 step approach would facilitate making the CAS part optional for those of use who have bet the farm on dynamic column names).

          Show
          Nate McCall added a comment - punt: let each node use a node-local map, and translate back and forth to full column name across node boundaries I would much prefer this approach. Particularly since I know of at least one large-ish cluster that had this working back in 0.8 after hacking it in by hand ( Terje Marthinussen if you folks are still around, would like to know your thoughts on this issue). This also feels like a better bounding of the task since the goal is to reduce local memory consumption and GC activity. Another thought: why not make this a function of the comparator for type-specific encode/decode? This would make common encode/decodes of some types become extremely efficient while using 'int counter' approach mentioned for other types. Either way, a node local approach initially would not preclude the use of CAS in the future (hopefully, the encapsulation by such a 2 step approach would facilitate making the CAS part optional for those of use who have bet the farm on dynamic column names).
          Hide
          Edward Capriolo added a comment -

          https://issues.apache.org/jira/browse/CASSANDRA-44
          https://issues.apache.org/jira/browse/CASSANDRA-45

          If we are going to use zookeeper why not do what was suggested in cassandra-44. Move all the schema to zookeeper. Then there is no schema consistency issues at all.

          We can continue to add stuff to zookeeper until cassandra becomes a poor mans hbase. CAS, atomic counters, row locks, lets do it!

          Can someone point me to some real work examples of how large the average column name is and how much this optimization will help. I am not sure I follow how this helps.

          I am looking at http://thelastpickle.com/2013/01/11/primary-keys-in-cql/

          RowKey: 3:201302
          => (column=2013-02-20 10\:58\:45+1300:, value=, timestamp=1357869161380000)
          => (column=2013-02-20 10\:58\:45+1300:is_dam_dirty_apes, value=01, timestamp=1357869161380000)
          => (column=2013-02-20 10\:58\:45+1300:pressure, value=00001ed2, timestamp=1357869161380000)
          => (column=2013-02-20 10\:58\:45+1300:temperature, value=0000001f, timestamp=1357869161380000)

          In this example the column names are '2013-02-20 10\:58\:45+1300' '2013-02-20 10\:58\:45+1300:is_dam_dirty_apes', '2013-02-20 10\:58\:45+1300:pressure, 2013-02-20 10\:58\:45+1300:temperature'

          How are we going to build caches of this? We must be also thinking of some new format not sstables?

          Show
          Edward Capriolo added a comment - https://issues.apache.org/jira/browse/CASSANDRA-44 https://issues.apache.org/jira/browse/CASSANDRA-45 If we are going to use zookeeper why not do what was suggested in cassandra-44. Move all the schema to zookeeper. Then there is no schema consistency issues at all. We can continue to add stuff to zookeeper until cassandra becomes a poor mans hbase. CAS, atomic counters, row locks, lets do it! Can someone point me to some real work examples of how large the average column name is and how much this optimization will help. I am not sure I follow how this helps. I am looking at http://thelastpickle.com/2013/01/11/primary-keys-in-cql/ RowKey: 3:201302 => (column=2013-02-20 10\:58\:45+1300:, value=, timestamp=1357869161380000) => (column=2013-02-20 10\:58\:45+1300:is_dam_dirty_apes, value=01, timestamp=1357869161380000) => (column=2013-02-20 10\:58\:45+1300:pressure, value=00001ed2, timestamp=1357869161380000) => (column=2013-02-20 10\:58\:45+1300:temperature, value=0000001f, timestamp=1357869161380000) In this example the column names are '2013-02-20 10\:58\:45+1300' '2013-02-20 10\:58\:45+1300:is_dam_dirty_apes', '2013-02-20 10\:58\:45+1300:pressure, 2013-02-20 10\:58\:45+1300:temperature' How are we going to build caches of this? We must be also thinking of some new format not sstables?
          Hide
          Jonathan Ellis added a comment -

          If you're going to respond to a comment over a year old, you should at least read the newer ones. There's no reason to use ZK now that we have CAS available.

          Show
          Jonathan Ellis added a comment - If you're going to respond to a comment over a year old, you should at least read the newer ones. There's no reason to use ZK now that we have CAS available.
          Hide
          Edward Capriolo added a comment -

          Schema has never really been eventually consistent since you can't do anything useful until the schema has propagated. Fortunately, since schema changes are rare, this isn't a problem in practice.

          I do not see how it will not be a problem in practice. In a single datacenter deployment it means you will not be able to add schema is zookeeper is not up.

          In a multi-data center deployment I do not know what it means. Based on how you want to interpret multi-datacenter zookeeper. Do all datacenters need to be reachable for schema changes? Then it is not AP.
          http://zookeeper-user.578899.n2.nabble.com/Managing-multi-site-clusters-with-Zookeeper-td4685686.html

          Nope, since it's 2 bytes for name length, then 2 for the name. To win vs 32bit int you'd have to have a single-letter name. (And of course any use of CompositeType blows this right out.

          Seriously. Why aren't people using 1 or 2 character column names? That would give you something like 676 and then in the schema we could just store a comment like 'pw means password'. Problem solved.

          Show
          Edward Capriolo added a comment - Schema has never really been eventually consistent since you can't do anything useful until the schema has propagated. Fortunately, since schema changes are rare, this isn't a problem in practice. I do not see how it will not be a problem in practice. In a single datacenter deployment it means you will not be able to add schema is zookeeper is not up. In a multi-data center deployment I do not know what it means. Based on how you want to interpret multi-datacenter zookeeper. Do all datacenters need to be reachable for schema changes? Then it is not AP. http://zookeeper-user.578899.n2.nabble.com/Managing-multi-site-clusters-with-Zookeeper-td4685686.html Nope, since it's 2 bytes for name length, then 2 for the name. To win vs 32bit int you'd have to have a single-letter name. (And of course any use of CompositeType blows this right out. Seriously. Why aren't people using 1 or 2 character column names? That would give you something like 676 and then in the schema we could just store a comment like 'pw means password'. Problem solved.
          Hide
          Jonathan Ellis added a comment -

          Trying to generate atomic cross node auto_ids or using zookeeper for coordination of this seems to go against the entire eventual consistency model of cassandra.

          Schema has never really been eventually consistent since you can't do anything useful until the schema has propagated. Fortunately, since schema changes are rare, this isn't a problem in practice.

          Why not have a node-local map and use murmur hash.

          As mentioned above, this is a reasonable approach, but it does mean we have to convert back-and-forth when talking to another node. (Which actually means you lose CPU vs the current approach, since you're serializing the same data but now you have to go through an extra layer of indirection to so do.)

          Also silly question. But if your columns are named 'pw' instead of 'password' aren't you more optimized then a 32bit or 64 bit integer anyway?

          Nope, since it's 2 bytes for name length, then 2 for the name. To win vs 32bit int you'd have to have a single-letter name. (And of course any use of CompositeType blows this right out.)

          Show
          Jonathan Ellis added a comment - Trying to generate atomic cross node auto_ids or using zookeeper for coordination of this seems to go against the entire eventual consistency model of cassandra. Schema has never really been eventually consistent since you can't do anything useful until the schema has propagated. Fortunately, since schema changes are rare, this isn't a problem in practice. Why not have a node-local map and use murmur hash. As mentioned above, this is a reasonable approach, but it does mean we have to convert back-and-forth when talking to another node. (Which actually means you lose CPU vs the current approach, since you're serializing the same data but now you have to go through an extra layer of indirection to so do.) Also silly question. But if your columns are named 'pw' instead of 'password' aren't you more optimized then a 32bit or 64 bit integer anyway? Nope, since it's 2 bytes for name length, then 2 for the name. To win vs 32bit int you'd have to have a single-letter name. (And of course any use of CompositeType blows this right out.)
          Hide
          Edward Capriolo added a comment -

          Trying to generate atomic cross node auto_ids or using zookeeper for coordination of this seems to go against the entire eventual consistency model of cassandra.

          Why not have a node-local map and use murmur hash.

          Also silly question. But if your columns are named 'pw' instead of 'password' aren't you more optimized then a 32bit or 64 bit integer anyway?

          Show
          Edward Capriolo added a comment - Trying to generate atomic cross node auto_ids or using zookeeper for coordination of this seems to go against the entire eventual consistency model of cassandra. Why not have a node-local map and use murmur hash. Also silly question. But if your columns are named 'pw' instead of 'password' aren't you more optimized then a 32bit or 64 bit integer anyway?
          Hide
          Jonathan Ellis added a comment -

          Much hand-waving ahead:

          1. if schema updates went over the regular migration path instead of being special case voodoo,
          2. and if we fired the schema change mechanism with a trigger,
          3. then we could use CAS to make sure everyone agrees on column name IDs
          Show
          Jonathan Ellis added a comment - Much hand-waving ahead: if schema updates went over the regular migration path instead of being special case voodoo, and if we fired the schema change mechanism with a trigger, then we could use CAS to make sure everyone agrees on column name IDs
          Hide
          Jonathan Ellis added a comment - - edited

          identityHashCode is basically the object's location in memory, so it's not going to be the same on different nodes. (So it would work for approach 2, I suppose, but I'd rather use a simple int counter.)

          Show
          Jonathan Ellis added a comment - - edited identityHashCode is basically the object's location in memory, so it's not going to be the same on different nodes. (So it would work for approach 2, I suppose, but I'd rather use a simple int counter.)
          Hide
          Dave Brosius added a comment -

          how about System.identityHashCode(string) ?

          Show
          Dave Brosius added a comment - how about System.identityHashCode(string) ?
          Hide
          Jonathan Ellis added a comment -

          And extremely collision-prone.

          Show
          Jonathan Ellis added a comment - And extremely collision-prone.
          Hide
          T Jake Luciani added a comment -

          Can't you use String.hashCode? it's portable.

          Show
          T Jake Luciani added a comment - Can't you use String.hashCode? it's portable.
          Hide
          Jonathan Ellis added a comment -

          The wrinkle here is concurrent schema changes – how can we make sure each node uses the same column ids for each name? I see two possible approaches:

          1. embed something like Zookeeper to standardize the id map
          2. punt: let each node use a node-local map, and translate back and forth to full column name across node boundaries
          Show
          Jonathan Ellis added a comment - The wrinkle here is concurrent schema changes – how can we make sure each node uses the same column ids for each name? I see two possible approaches: embed something like Zookeeper to standardize the id map punt: let each node use a node-local map, and translate back and forth to full column name across node boundaries

            People

            • Assignee:
              Jason Brown
              Reporter:
              Jonathan Ellis
            • Votes:
              9 Vote for this issue
              Watchers:
              34 Start watching this issue

              Dates

              • Created:
                Updated:

                Development