Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.0.0
    • Component/s: Core
    • Labels:

      Description

      We should be able to do SSTable compression which would trade CPU for I/O (almost always a good trade).

      1. snappy-java-1.0.3-rc4.jar
        972 kB
        Pavel Yaskevich
      2. CASSANDRA-47-v5.patch
        69 kB
        Sylvain Lebresne
      3. CASSANDRA-47-v4-fixes.patch
        16 kB
        Pavel Yaskevich
      4. CASSANDRA-47-v4.patch
        51 kB
        Pavel Yaskevich
      5. CASSANDRA-47-v3-rebased.patch
        41 kB
        Pavel Yaskevich
      6. CASSANDRA-47-v3.patch
        50 kB
        Pavel Yaskevich
      7. CASSANDRA-47-v2.patch
        80 kB
        Pavel Yaskevich
      8. CASSANDRA-47.patch
        70 kB
        Pavel Yaskevich

        Issue Links

          Activity

          Hide
          Chris Goffinet added a comment -

          Are you wanting compression just for the data file for now? Do you think LZO might be an interesting codec for this?

          Show
          Chris Goffinet added a comment - Are you wanting compression just for the data file for now? Do you think LZO might be an interesting codec for this?
          Hide
          Johan Oskarsson added a comment -

          While Lzo is great, it is unfortunately not compatible with the Apache license. They had to remove it from Hadoop, see HADOOP-4874. In that ticket they mention a couple of alternatives such as http://www.fastlz.org/ or lzf http://h2database.googlecode.com/svn/trunk/h2/src/main/org/h2/compress/

          It would be possible to do what Hadoop has done and released the lzo bindings separately, but that would make it harder for users to find and use.

          Show
          Johan Oskarsson added a comment - While Lzo is great, it is unfortunately not compatible with the Apache license. They had to remove it from Hadoop, see HADOOP-4874 . In that ticket they mention a couple of alternatives such as http://www.fastlz.org/ or lzf http://h2database.googlecode.com/svn/trunk/h2/src/main/org/h2/compress/ It would be possible to do what Hadoop has done and released the lzo bindings separately, but that would make it harder for users to find and use.
          Hide
          Jonathan Ellis added a comment -

          A very straightforward way to compress would be on a per-[column]index-block basis. for large rows this should work well, and requires only minor changes to sstable reading/writing. compression across multiple rows would work better for small rows, but requires more invasive changes, and adds a lot of complexity for large rows (do you break the row across multiple blocks? or decompress the entire row into memory every time? both options suck).

          Show
          Jonathan Ellis added a comment - A very straightforward way to compress would be on a per- [column] index-block basis. for large rows this should work well, and requires only minor changes to sstable reading/writing. compression across multiple rows would work better for small rows, but requires more invasive changes, and adds a lot of complexity for large rows (do you break the row across multiple blocks? or decompress the entire row into memory every time? both options suck).
          Hide
          Jonathan Ellis added a comment -

          Voldemort uses LZF for speed-optimized compression: http://code.google.com/p/project-voldemort/issues/detail?id=172

          Hadoop may have FastLZ support soon: HADOOP-6349

          Show
          Jonathan Ellis added a comment - Voldemort uses LZF for speed-optimized compression: http://code.google.com/p/project-voldemort/issues/detail?id=172 Hadoop may have FastLZ support soon: HADOOP-6349
          Hide
          Tatu Saloranta added a comment -

          I just submitted slightly rewritten version of LZF in Voldemort, optimized for specific use case (straight byte[] -> byte[], no streaming) for Voldemort. If that's the model that'd be used here (i.e. input size is fully known, no need for streaming) it'd work pretty well (adding streaming is not difficult, just needs bit more code to work well with all chunk sizes).

          From my testing LZF seems like a very promising candidate for these use cases – very fast, overhead generally no larger than parsing overhead for stored conten – and code is very simple, so it is a good algorithm to start with.
          If more compression is needed other algorithms (like basic deflate (== gzip)) could be plugged in later on.

          Anyway, main problem obviously is not the codec, but if you need any help with LZF codec, just let me know (or others on voldemort list).

          Show
          Tatu Saloranta added a comment - I just submitted slightly rewritten version of LZF in Voldemort, optimized for specific use case (straight byte[] -> byte[], no streaming) for Voldemort. If that's the model that'd be used here (i.e. input size is fully known, no need for streaming) it'd work pretty well (adding streaming is not difficult, just needs bit more code to work well with all chunk sizes). From my testing LZF seems like a very promising candidate for these use cases – very fast, overhead generally no larger than parsing overhead for stored conten – and code is very simple, so it is a good algorithm to start with. If more compression is needed other algorithms (like basic deflate (== gzip)) could be plugged in later on. Anyway, main problem obviously is not the codec, but if you need any help with LZF codec, just let me know (or others on voldemort list).
          Hide
          Jonathan Ellis added a comment -

          Thanks. Tatu!

          Show
          Jonathan Ellis added a comment - Thanks. Tatu!
          Hide
          Jonathan Ellis added a comment -

          Let's push data format change to 0.8. I'm burned out on it, and nobody else is stepping up to help review.

          We can fix CASSANDRA-16 with the current data format for 0.7.

          Show
          Jonathan Ellis added a comment - Let's push data format change to 0.8. I'm burned out on it, and nobody else is stepping up to help review. We can fix CASSANDRA-16 with the current data format for 0.7.
          Hide
          Kazuki Ohta added a comment -

          Just a comment. SSTable compression is very useful for storing large web pages. By using order-preserving hash, we can store the web pages of the same domain, maybe in the same SSTable.

          At this time, the vcdiff algorithm (Bentley-McIlroy 99 Scheme) can effectively compress the longest common strings. Currently, many web pages are constructed by using the same templates, so this algorithm is able to eliminate the template part and remain only the content part. I've blogged about this algorithm.

          I think this will open up the huge opportunities for cassandra. Even in a single block, this will work fine. If the compression becomes pluggable, I want to implement this algorithm part.

          Show
          Kazuki Ohta added a comment - Just a comment. SSTable compression is very useful for storing large web pages. By using order-preserving hash, we can store the web pages of the same domain, maybe in the same SSTable. At this time, the vcdiff algorithm (Bentley-McIlroy 99 Scheme) can effectively compress the longest common strings. Currently, many web pages are constructed by using the same templates, so this algorithm is able to eliminate the template part and remain only the content part. I've blogged about this algorithm. http://kzk9.net/b/2010/02/vcdiff-data-compression-using-long-common-strings/ I think this will open up the huge opportunities for cassandra. Even in a single block, this will work fine. If the compression becomes pluggable, I want to implement this algorithm part.
          Hide
          Holden Robbins added a comment -

          Feel free to tell me I'm off-base here, but what about doing something super simple like storing the segment as compressed and un-compressing when it's accessed on disk. Compaction process can possibly clean up uncompressed segments? I'm thinking this would solve my particular use case well (log data) since our requirements are to store a large amount of data but the majority of the reads will only be on a small subset of recently inserted data.

          If it sounds like a decent approach I'll be happy to put together a patch.

          Show
          Holden Robbins added a comment - Feel free to tell me I'm off-base here, but what about doing something super simple like storing the segment as compressed and un-compressing when it's accessed on disk. Compaction process can possibly clean up uncompressed segments? I'm thinking this would solve my particular use case well (log data) since our requirements are to store a large amount of data but the majority of the reads will only be on a small subset of recently inserted data. If it sounds like a decent approach I'll be happy to put together a patch.
          Hide
          Jonathan Ellis added a comment -

          Not sure what you mean by segment. The sstable files are too large to de/compress as a unit. Large rows make sense to compress as a unit (at the index block level most likely); the question is what to do about smaller rows.

          Show
          Jonathan Ellis added a comment - Not sure what you mean by segment. The sstable files are too large to de/compress as a unit. Large rows make sense to compress as a unit (at the index block level most likely); the question is what to do about smaller rows.
          Hide
          Rustam Aliyev added a comment -

          Just wanted to share this benchmark from QuickLZ: http://www.quicklz.com/bench.html

          LZF performance quite impressive in terms of both - compress/decompress speeds and compression ratio.

          There's another LZF implementation available under Apache 2 license: https://github.com/ning/compress

          Show
          Rustam Aliyev added a comment - Just wanted to share this benchmark from QuickLZ: http://www.quicklz.com/bench.html LZF performance quite impressive in terms of both - compress/decompress speeds and compression ratio. There's another LZF implementation available under Apache 2 license: https://github.com/ning/compress
          Hide
          T Jake Luciani added a comment -

          I've been checking out the ning compress module. Seems like it would be the best for us given its apache lisc LZF and the code is small. Rather than making it a dependency I think we would be best served to ingest it into the project...

          This relates 674 and 1472

          Show
          T Jake Luciani added a comment - I've been checking out the ning compress module. Seems like it would be the best for us given its apache lisc LZF and the code is small. Rather than making it a dependency I think we would be best served to ingest it into the project... This relates 674 and 1472
          Hide
          Jonathan Ellis added a comment -

          We can do this (and CASSANDRA-1717) without a full-on rewrite of the sstable layer.

          As I said 18 months ago :

          A very straightforward way to compress would be on a per-[column]index-block basis. for large rows this should work well

          The trickier part is compressing small rows as well, efficiently. We can do this by adding rows into blocks (using the same tunable parameter for block size) and changing the index entry to a 3-tuple: key, compressed block offset from start of file, offset of row from start of block once uncompressed.

          Either kind of block should end with a checksum to handle CASSANDRA-1717.

          So we would have two types of compression, rows-in-blocks and blocks-in-rows. We decide at write time which to use based on our sstable row size statistics: if average row size is less than 10% of the index block size, and max row size is strictly less than that size, then we use rows-in-blocks; otherwise, we use blocks-in-rows. So we bias towards blocks-in-rows to avoid uncompressing data we don't need so much.

          Show
          Jonathan Ellis added a comment - We can do this (and CASSANDRA-1717 ) without a full-on rewrite of the sstable layer. As I said 18 months ago : A very straightforward way to compress would be on a per- [column] index-block basis. for large rows this should work well The trickier part is compressing small rows as well, efficiently. We can do this by adding rows into blocks (using the same tunable parameter for block size) and changing the index entry to a 3-tuple: key, compressed block offset from start of file, offset of row from start of block once uncompressed. Either kind of block should end with a checksum to handle CASSANDRA-1717 . So we would have two types of compression, rows-in-blocks and blocks-in-rows. We decide at write time which to use based on our sstable row size statistics: if average row size is less than 10% of the index block size, and max row size is strictly less than that size, then we use rows-in-blocks; otherwise, we use blocks-in-rows. So we bias towards blocks-in-rows to avoid uncompressing data we don't need so much.
          Hide
          Brandon Williams added a comment - - edited

          I think this idea hits the sweet spot where we currently stand. Compression is a huge win for us, and not having to rewrite the entire format simplifies the complexity greatly.

          Show
          Brandon Williams added a comment - - edited I think this idea hits the sweet spot where we currently stand. Compression is a huge win for us, and not having to rewrite the entire format simplifies the complexity greatly.
          Hide
          Terje Marthinussen added a comment -

          This is not so interesting for a "proper" solution maybe, but adding just for the reference.

          I needed to get space for more data, so I recently just crashed into a quick compression hack for supercolumns.

          I was considering to compress the index blocks as Jonathan suggested, but I could not make up my mind on how safe that would be in terms of other code accessing the data and had a bit short time, so I looked for something more isolated.

          Final decision was to simply compress the serialized columns in a supercolumn (+ add a bit caching to avoid recompressing all the time when serialized size is requested)

          The data I have, has supercolumns with typically 50-60 subcolumns. Mostly small strings or numbers.
          In total, the subcolumns makes up 600-1200 bytes of data when serialized.

          Usually a fair bit of supercolumns per row.

          My test data was 447 keys. I tested with the ning lzf jars and the default java.util.zip.
          This is not necessarily a good test, but I think json2sstable is somewhat useful to measure relative impact between implementations although not useful to determine real performance in any way.

          In addition, I made a simple dictionary of column names (only applied to supercolumns) as the column names was not very well compressed when looking at just a single supercolumn at a time.

          The result of both the digest and compression:
          Standard cassandra. json2sstable:
          real 0m55.148s
          user 1m50.023s
          sys 0m2.856s
          sstable: 190MB

          ning.com:
          real 1m8.315s
          user 2m18.361s
          sys 0m4.600s
          sstable: 108MB

          java.util.zip
          real 1m35.899s
          user 2m49.691s
          sys 0m2.940s
          sstable: 90mb

          As a reference, the whole sstable files compresses as follows:
          ning.com (command line)
          real 0m1.803s
          user 0m1.536s
          sys 0m0.396s
          sstable: 80MB

          gzip (command line)
          real 0m6.175s
          user 0m6.076s
          sys 0m0.084s
          sstable: 48MB

          I doubt this implementation has much for inclusion in a release. Just added the numbers for the reference.
          Of course, if requested, I could see if I could make the patch available somewhere.

          Show
          Terje Marthinussen added a comment - This is not so interesting for a "proper" solution maybe, but adding just for the reference. I needed to get space for more data, so I recently just crashed into a quick compression hack for supercolumns. I was considering to compress the index blocks as Jonathan suggested, but I could not make up my mind on how safe that would be in terms of other code accessing the data and had a bit short time, so I looked for something more isolated. Final decision was to simply compress the serialized columns in a supercolumn (+ add a bit caching to avoid recompressing all the time when serialized size is requested) The data I have, has supercolumns with typically 50-60 subcolumns. Mostly small strings or numbers. In total, the subcolumns makes up 600-1200 bytes of data when serialized. Usually a fair bit of supercolumns per row. My test data was 447 keys. I tested with the ning lzf jars and the default java.util.zip. This is not necessarily a good test, but I think json2sstable is somewhat useful to measure relative impact between implementations although not useful to determine real performance in any way. In addition, I made a simple dictionary of column names (only applied to supercolumns) as the column names was not very well compressed when looking at just a single supercolumn at a time. The result of both the digest and compression: Standard cassandra. json2sstable: real 0m55.148s user 1m50.023s sys 0m2.856s sstable: 190MB ning.com: real 1m8.315s user 2m18.361s sys 0m4.600s sstable: 108MB java.util.zip real 1m35.899s user 2m49.691s sys 0m2.940s sstable: 90mb As a reference, the whole sstable files compresses as follows: ning.com (command line) real 0m1.803s user 0m1.536s sys 0m0.396s sstable: 80MB gzip (command line) real 0m6.175s user 0m6.076s sys 0m0.084s sstable: 48MB I doubt this implementation has much for inclusion in a release. Just added the numbers for the reference. Of course, if requested, I could see if I could make the patch available somewhere.
          Hide
          Terje Marthinussen added a comment - - edited

          Just curious if any active work is done or planned near future on compressing larger data blocks or is it all suspended waiting for a new sstable design?

          Having played with compression of just supercolumns for a while, I am a bit tempted to test out compression of larger blocks of data. At least row level compression seems reasonably easy to do.

          Some experiences so far which may be usefull:

          • Compression on sstables may actually be helpfull on memory pressure, but with my current implementation, non-batched update throughput may drop 50%.I am not 100% sure why actually.
          • Flushing of (compressed) memtables and compactions are clear potential bottlenecks
            The obvious trouble makers here is the fact that you ceep

          For really high pressure work, I think it would be usefull to only compress tables once they pass a certain size to reduce the amount of recompression occuring on memtable flushes and when compacting small sstables (which is generally not a big disk problem anyway)

          This is a bit awkward when doing things like I do in the super columns as I believe the supercolumn does not know anything about the data it is part of (except for recently, the deserializer has that info through "inner".

          It would anyway probably be cleaner to let the datastructures/methods using the SC decide when to compress and noth

          • Working on a SC level, there seems to be some 10-15% extra compression on this specific data if column names that are highly repetetive in SC's can be extracted into some meta data structure so you only store references to these in the column names. That is, the final data is goes from about 40% compression to 50% compression.

          I don't think the effect of this will be equally big with larger blocks, but I suspect there should be some effect.

          • total size reduction of the sstables when using a dictionary for column names as well as timestamps and variable length lenght fields, is currently in the 60-65% range. It is however mainly beneficial for those that have supercolumns with at least a handfull of columns (400-600 bytes of serialized column data per sc at least)
          • Reducing the meta data on columns by building a dictionary of timestamps as well as variable length name/value length data (instead of fixed short/int) cuts down another 10% in my test (I have just done a very quick simulation of this by a very quick "10 minute" hack on the serializer)
          • We may want to look at how we can reuse whole compressed rows on compactions if for instance the other tables you compact with do not have the same data
          • We may want a new cache on the uncompressed disk chunks. In my case, I preserve the compressed part of the supercolumn and

          In my supercolumn compression case, I have a cache for the compressed data so I can write that back without recompression if not modified. This also makes calls to get the serialized size cheaper (don't need to compress both to find serialized size and to actually serialize)

          If people are interested in adding any of the above to current cassandra, I will try to get time to make some of this up to a quality where it could be used by the general public.

          If not, I will wait for new sstables to get a bit more ready and see if I can contribute there instead.

          Show
          Terje Marthinussen added a comment - - edited Just curious if any active work is done or planned near future on compressing larger data blocks or is it all suspended waiting for a new sstable design? Having played with compression of just supercolumns for a while, I am a bit tempted to test out compression of larger blocks of data. At least row level compression seems reasonably easy to do. Some experiences so far which may be usefull: Compression on sstables may actually be helpfull on memory pressure, but with my current implementation, non-batched update throughput may drop 50%.I am not 100% sure why actually. Flushing of (compressed) memtables and compactions are clear potential bottlenecks The obvious trouble makers here is the fact that you ceep For really high pressure work, I think it would be usefull to only compress tables once they pass a certain size to reduce the amount of recompression occuring on memtable flushes and when compacting small sstables (which is generally not a big disk problem anyway) This is a bit awkward when doing things like I do in the super columns as I believe the supercolumn does not know anything about the data it is part of (except for recently, the deserializer has that info through "inner". It would anyway probably be cleaner to let the datastructures/methods using the SC decide when to compress and noth Working on a SC level, there seems to be some 10-15% extra compression on this specific data if column names that are highly repetetive in SC's can be extracted into some meta data structure so you only store references to these in the column names. That is, the final data is goes from about 40% compression to 50% compression. I don't think the effect of this will be equally big with larger blocks, but I suspect there should be some effect. total size reduction of the sstables when using a dictionary for column names as well as timestamps and variable length lenght fields, is currently in the 60-65% range. It is however mainly beneficial for those that have supercolumns with at least a handfull of columns (400-600 bytes of serialized column data per sc at least) Reducing the meta data on columns by building a dictionary of timestamps as well as variable length name/value length data (instead of fixed short/int) cuts down another 10% in my test (I have just done a very quick simulation of this by a very quick "10 minute" hack on the serializer) We may want to look at how we can reuse whole compressed rows on compactions if for instance the other tables you compact with do not have the same data We may want a new cache on the uncompressed disk chunks. In my case, I preserve the compressed part of the supercolumn and In my supercolumn compression case, I have a cache for the compressed data so I can write that back without recompression if not modified. This also makes calls to get the serialized size cheaper (don't need to compress both to find serialized size and to actually serialize) If people are interested in adding any of the above to current cassandra, I will try to get time to make some of this up to a quality where it could be used by the general public. If not, I will wait for new sstables to get a bit more ready and see if I can contribute there instead.
          Hide
          Terje Marthinussen added a comment -

          And yes, that traversal two times of the row to calculate serialized size before writing the row is reasonably expensive with compression.

          Any good reason we absolutely have to do this in two passes instead of one?
          Or just the way it is for historical purposes?

          Yes, we would need to write indexes etc. after the row or in a separate file if we don't traverse it twice, but this would seems a relatively small change on first sight?

          Show
          Terje Marthinussen added a comment - And yes, that traversal two times of the row to calculate serialized size before writing the row is reasonably expensive with compression. Any good reason we absolutely have to do this in two passes instead of one? Or just the way it is for historical purposes? Yes, we would need to write indexes etc. after the row or in a separate file if we don't traverse it twice, but this would seems a relatively small change on first sight?
          Hide
          Ryan King added a comment -

          Stu is working on https://issues.apache.org/jira/browse/CASSANDRA-674 which will improve the file size dramatically.

          Show
          Ryan King added a comment - Stu is working on https://issues.apache.org/jira/browse/CASSANDRA-674 which will improve the file size dramatically.
          Hide
          Ryan King added a comment -

          I think this is going to be obsoleted by CASSANDRA-674.

          Show
          Ryan King added a comment - I think this is going to be obsoleted by CASSANDRA-674 .
          Hide
          Pavel Yaskevich added a comment - - edited

          Patch introduces CompressedDataFile with Input/Output classes. Snappy is used for compression/decompression because it showed better speeds in tests comparing to ning. Files are split into 4 bytes + 64kb chunks where 4 bytes hold information about compressed chunk size, not that current SSTable file format is preserved and no modifications were made to index, statistics or filter components. Both Input and Output classes extend RandomAccessFile so random I/O works as expected.

          All SSTable files are opened using CompressedDataFile.Input. On startup when SSTableReader.open gets called it first checks if data file is already compressed and compresses if it was not already compressed so users won't have a problem after they update.

          At the header of the file it reserves 8 bytes for a "real data size" so other components of the system that use SSTables and SSTables itself have no idea that data file is compressed.

          Streaming of data file sends decompressed chunks for convenience of maintaing transfer and receiving party compresses all data before write to the backing file (see CompressedDataFile.transfer(...) and CompressedFileReceiver class).

          Tests are showing dramatic performance increase when reading 1 million rows created with 1024 bytes random values. Current code takes >> 1000 secs to read but with current path only 175 secs. Using 64kb buffer 1.7GB file could be compressed into 110MB (data added using ./bin/stress -n 1000000 -S 1024 -V, where -V option generates average size values and different cardinality from 50 (default) to 250).

          Writes perform a bit better like 5-10%.

          Show
          Pavel Yaskevich added a comment - - edited Patch introduces CompressedDataFile with Input/Output classes. Snappy is used for compression/decompression because it showed better speeds in tests comparing to ning. Files are split into 4 bytes + 64kb chunks where 4 bytes hold information about compressed chunk size, not that current SSTable file format is preserved and no modifications were made to index, statistics or filter components. Both Input and Output classes extend RandomAccessFile so random I/O works as expected. All SSTable files are opened using CompressedDataFile.Input. On startup when SSTableReader.open gets called it first checks if data file is already compressed and compresses if it was not already compressed so users won't have a problem after they update. At the header of the file it reserves 8 bytes for a "real data size" so other components of the system that use SSTables and SSTables itself have no idea that data file is compressed. Streaming of data file sends decompressed chunks for convenience of maintaing transfer and receiving party compresses all data before write to the backing file (see CompressedDataFile.transfer(...) and CompressedFileReceiver class). Tests are showing dramatic performance increase when reading 1 million rows created with 1024 bytes random values. Current code takes >> 1000 secs to read but with current path only 175 secs. Using 64kb buffer 1.7GB file could be compressed into 110MB (data added using ./bin/stress -n 1000000 -S 1024 -V, where -V option generates average size values and different cardinality from 50 (default) to 250). Writes perform a bit better like 5-10%.
          Hide
          Pavel Yaskevich added a comment -

          rebased to latest trunk (commit 42ebfb15951b7795c9dbb4e7f8f73640e582d25b)

          Show
          Pavel Yaskevich added a comment - rebased to latest trunk (commit 42ebfb15951b7795c9dbb4e7f8f73640e582d25b)
          Hide
          Stu Hood added a comment -

          where -r option generates random values

          The -r flag generates random keys: unless you modified stress.java, the values will be the same for every row.

          Show
          Stu Hood added a comment - where -r option generates random values The -r flag generates random keys: unless you modified stress.java, the values will be the same for every row.
          Hide
          Pavel Yaskevich added a comment - - edited

          The -r flag generates random keys: unless you modified stress.java, the values will be the same for every row.

          oh, sorry! I meant -V not -r also used various cardinality 50-250 in the tests

          Show
          Pavel Yaskevich added a comment - - edited The -r flag generates random keys: unless you modified stress.java, the values will be the same for every row. oh, sorry! I meant -V not -r also used various cardinality 50-250 in the tests
          Hide
          Jonathan Ellis added a comment -

          How does the [row] index interact with the block implementation?

          Show
          Jonathan Ellis added a comment - How does the [row] index interact with the block implementation?
          Hide
          Pavel Yaskevich added a comment - - edited

          It just refers to uncompressed locations, I didn't see a need to change that.

          Show
          Pavel Yaskevich added a comment - - edited It just refers to uncompressed locations, I didn't see a need to change that.
          Hide
          Benjamin Coverston added a comment -

          I was worried a bit about this ticket while working on 1608 because it had the potential to be a hairy merge. Looking through the patch however the implementation looks clean, unobtrusive and well encapsulated. This makes me really excited. Contrary to my initial fears I'm anticipating that this will actually help with 1608 and probably even help increase IO throughput in the data directory in a dramatic way.

          +1 for this approach.

          Show
          Benjamin Coverston added a comment - I was worried a bit about this ticket while working on 1608 because it had the potential to be a hairy merge. Looking through the patch however the implementation looks clean, unobtrusive and well encapsulated. This makes me really excited. Contrary to my initial fears I'm anticipating that this will actually help with 1608 and probably even help increase IO throughput in the data directory in a dramatic way. +1 for this approach.
          Hide
          Chris Burroughs added a comment -

          .bq Using 64kb buffer 1.7GB file could be compressed into 110MB (data added using ./bin/stress -n 1000000 -S 1024 -V, where -V option generates average size values and different cardinality from 50 (default) to 250).

          This seems like an unrealistically good compression ratio. If I gzip a real world SSTable that has redundant data that should be ripe for compression I only see 641M-->217M. What's the gzip compression ratio with the SSTables that stress.java workload generates?

          Stu, could you post your custom YCSB workload from CASSANDRA-674 for comparison?

          Show
          Chris Burroughs added a comment - .bq Using 64kb buffer 1.7GB file could be compressed into 110MB (data added using ./bin/stress -n 1000000 -S 1024 -V, where -V option generates average size values and different cardinality from 50 (default) to 250). This seems like an unrealistically good compression ratio. If I gzip a real world SSTable that has redundant data that should be ripe for compression I only see 641M-->217M. What's the gzip compression ratio with the SSTables that stress.java workload generates? Stu, could you post your custom YCSB workload from CASSANDRA-674 for comparison?
          Hide
          Pavel Yaskevich added a comment -

          This seems like an unrealistically good compression ratio. If I gzip a real world SSTable that has redundant data that should be ripe for compression I only see 641M-->217M. What's the gzip compression ratio with the SSTables that stress.java workload generates?

          You can easily test it yourself: for example ./bin/stress -S 1024 -n 1000000 -C 250 -V wait for compactions to finish and check block size of the resulting files (using ls -lahs), I see 3.8GB compressed into 781MB in my tests. internal_op_rate with the current trunk code is around 450-500 but with current patch it is about 2800-3000 on Quad-Core AMD Opteron(tm) Processor 2374 HE 4229730MHz on each core, 2GB mem (rackspace instance). cardinality of 250 is 5 times bigger that default + average size values using -V option.

          Show
          Pavel Yaskevich added a comment - This seems like an unrealistically good compression ratio. If I gzip a real world SSTable that has redundant data that should be ripe for compression I only see 641M-->217M. What's the gzip compression ratio with the SSTables that stress.java workload generates? You can easily test it yourself: for example ./bin/stress -S 1024 -n 1000000 -C 250 -V wait for compactions to finish and check block size of the resulting files (using ls -lahs), I see 3.8GB compressed into 781MB in my tests. internal_op_rate with the current trunk code is around 450-500 but with current patch it is about 2800-3000 on Quad-Core AMD Opteron(tm) Processor 2374 HE 4229730MHz on each core, 2GB mem (rackspace instance). cardinality of 250 is 5 times bigger that default + average size values using -V option.
          Hide
          Andrey Stepachev added a comment -

          Are there any chances to get this work in 0.8.x? (simply apply of the patch is failed)

          Show
          Andrey Stepachev added a comment - Are there any chances to get this work in 0.8.x? (simply apply of the patch is failed)
          Hide
          Jonathan Ellis added a comment -

          My guess is that a backport would be difficult because of the changes already in 1.0 (CASSANDRA-2062, CASSANDRA-1610, etc) that deal with the same parts of the code.

          Compression will only be officially supported in 1.0.

          Show
          Jonathan Ellis added a comment - My guess is that a backport would be difficult because of the changes already in 1.0 ( CASSANDRA-2062 , CASSANDRA-1610 , etc) that deal with the same parts of the code. Compression will only be officially supported in 1.0.
          Hide
          Andrey Stepachev added a comment -

          It is a pity, that only in 1.0. This is a very desirable feature (especially from the point of view of the former user of hbase).

          Show
          Andrey Stepachev added a comment - It is a pity, that only in 1.0. This is a very desirable feature (especially from the point of view of the former user of hbase).
          Hide
          Jonathan Ellis added a comment -

          I agree, but 1.0 will be out in October. This ticket is over two years old, it can wait another couple months.

          Show
          Jonathan Ellis added a comment - I agree, but 1.0 will be out in October. This ticket is over two years old, it can wait another couple months.
          Hide
          Andrey Stepachev added a comment -

          I agree, but I came from hbase, where this feature was implemented long time ago. So it was very surprising, that cassandra doesn't has such feature. But with such timeframe (like October) it is really not a very big concern, so I'll wait.
          In any case, thanks for your work .

          Show
          Andrey Stepachev added a comment - I agree, but I came from hbase, where this feature was implemented long time ago. So it was very surprising, that cassandra doesn't has such feature. But with such timeframe (like October) it is really not a very big concern, so I'll wait. In any case, thanks for your work .
          Hide
          Stu Hood added a comment -

          I haven't had any luck seeing actual compression with this patch... is there a manual step to enable it? On OSX, the patch slowed the server down to a crawl, but did not result in compression. Performance seems to be reasonable on Linux, but without any effect: running bin/stress -S 1024 -n 1000000 -C 250 -V resulted in 3.3 GB of data.

          Show
          Stu Hood added a comment - I haven't had any luck seeing actual compression with this patch... is there a manual step to enable it? On OSX, the patch slowed the server down to a crawl, but did not result in compression. Performance seems to be reasonable on Linux, but without any effect: running bin/stress -S 1024 -n 1000000 -C 250 -V resulted in 3.3 GB of data.
          Hide
          Stu Hood added a comment -

          example ./bin/stress -S 1024 -n 1000000 -C 250 -V [...] I see 3.8GB compressed into 781MB in my tests.

          Why would the uncompressed size for -C 250 be different from the uncompressed size for -C 50: 3.8 GB vs the 1.7 GB from before?

          On a side note: a cardinality of 250 makes for less variance in the average size of the random generated values, but I still see relatively large size differences between consecutive runs. It might be worth opening a separate ticket for stress.java to make the generated random values a fixed size.


          With the understanding that there is a lot of variance in the results, here are some preliminary results for the bin/stress -S 1024 -n 1000000 -C 250 -V workload:

          build disk volume (bytes) write runtime (s) read runtime (s) read ops/s
          trunk 4,015,004,187 372 >> 1000s ~216
          #674 + #2319 594,796,624 273 255 3845
          #47*        

          * need to figure out the problem I was having above

          Show
          Stu Hood added a comment - example ./bin/stress -S 1024 -n 1000000 -C 250 -V [...] I see 3.8GB compressed into 781MB in my tests. Why would the uncompressed size for -C 250 be different from the uncompressed size for -C 50: 3.8 GB vs the 1.7 GB from before? On a side note: a cardinality of 250 makes for less variance in the average size of the random generated values, but I still see relatively large size differences between consecutive runs. It might be worth opening a separate ticket for stress.java to make the generated random values a fixed size. With the understanding that there is a lot of variance in the results, here are some preliminary results for the bin/stress -S 1024 -n 1000000 -C 250 -V workload: build disk volume (bytes) write runtime (s) read runtime (s) read ops/s trunk 4,015,004,187 372 >> 1000s ~216 #674 + #2319 594,796,624 273 255 3845 #47*         * need to figure out the problem I was having above
          Hide
          Pavel Yaskevich added a comment -

          I haven't had any luck seeing actual compression with this patch... is there a manual step to enable it? On OSX, the patch slowed the server down to a crawl, but did not result in compression. Performance seems to be reasonable on Linux, but without any effect: running bin/stress -S 1024 -n 1000000 -C 250 -V resulted in 3.3 GB of data.

          As I wrote previously - right now on linux you can see compressed data size by running ls with ahs and taking a size in blocks, because current implementations uses file holes (reserves free space for the future chunk changes). I'm currently working on the way to eliminate that need.

          Show
          Pavel Yaskevich added a comment - I haven't had any luck seeing actual compression with this patch... is there a manual step to enable it? On OSX, the patch slowed the server down to a crawl, but did not result in compression. Performance seems to be reasonable on Linux, but without any effect: running bin/stress -S 1024 -n 1000000 -C 250 -V resulted in 3.3 GB of data. As I wrote previously - right now on linux you can see compressed data size by running ls with ahs and taking a size in blocks, because current implementations uses file holes (reserves free space for the future chunk changes). I'm currently working on the way to eliminate that need.
          Hide
          Jonathan Ellis added a comment -

          current implementations uses file holes (reserves free space for the future chunk changes)

          Is this for where we have to seek back to write the row size on large rows?

          We're already making two passes when compacting large rows (first to compute the indexes), we could make the first pass compute the serialized size so we don't have to seek after the 2nd pass that writes the data.

          Show
          Jonathan Ellis added a comment - current implementations uses file holes (reserves free space for the future chunk changes) Is this for where we have to seek back to write the row size on large rows? We're already making two passes when compacting large rows (first to compute the indexes), we could make the first pass compute the serialized size so we don't have to seek after the 2nd pass that writes the data.
          Hide
          Pavel Yaskevich added a comment -

          Why would the uncompressed size for -C 250 be different from the uncompressed size for -C 50: 3.8 GB vs the 1.7 GB from before?

          I did simply choose the largest files from individual compactions.

          Also I don't know why you getting only 3.7GB of data after `-S 1024 -n 1000000 -C 250 -V` because in all of my tests I get about 5.1GB which current trunk code and with patch applied.

          My results:

          build disk volume (bytes) write runtime (s) read runtime (s) read ops/s
          trunk 5,241,718,144 166 2210 ~450
          #47 5,090,003,162 (1.2GB of blocks aka real size) 156 480 ~2100

          Both sizes are after last major compaction, cassandra with default configuration running on Quad-Core AMD Opteron(tm) Processor 2374 HE with 4229730MHz on each core, 2GB RAM - Debian 5.0 (Lenny) (kernel 2.6.35.4-rscloud) hosted on rackspace.

          Show
          Pavel Yaskevich added a comment - Why would the uncompressed size for -C 250 be different from the uncompressed size for -C 50: 3.8 GB vs the 1.7 GB from before? I did simply choose the largest files from individual compactions. Also I don't know why you getting only 3.7GB of data after `-S 1024 -n 1000000 -C 250 -V` because in all of my tests I get about 5.1GB which current trunk code and with patch applied. My results: build disk volume (bytes) write runtime (s) read runtime (s) read ops/s trunk 5,241,718,144 166 2210 ~450 #47 5,090,003,162 (1.2GB of blocks aka real size) 156 480 ~2100 Both sizes are after last major compaction, cassandra with default configuration running on Quad-Core AMD Opteron(tm) Processor 2374 HE with 4229730MHz on each core, 2GB RAM - Debian 5.0 (Lenny) (kernel 2.6.35.4-rscloud) hosted on rackspace.
          Hide
          Pavel Yaskevich added a comment -

          Is this for where we have to seek back to write the row size on large rows?

          Exactly.

          We're already making two passes when compacting large rows (first to compute the indexes), we could make the first pass compute the serialized size so we don't have to seek after the 2nd pass that writes the data.

          I will look into it.

          Show
          Pavel Yaskevich added a comment - Is this for where we have to seek back to write the row size on large rows? Exactly. We're already making two passes when compacting large rows (first to compute the indexes), we could make the first pass compute the serialized size so we don't have to seek after the 2nd pass that writes the data. I will look into it.
          Hide
          Jonathan Ellis added a comment -

          My preference would be to do seekless 2-pass as a separate patch before this one, btw.

          Show
          Jonathan Ellis added a comment - My preference would be to do seekless 2-pass as a separate patch before this one, btw.
          Hide
          Terje Marthinussen added a comment -

          Tried to import some data with json2sstable to compare sizes, but seems like json2sstable is not entirely working yet?

          Show
          Terje Marthinussen added a comment - Tried to import some data with json2sstable to compare sizes, but seems like json2sstable is not entirely working yet?
          Hide
          Pavel Yaskevich added a comment -

          Tests are passing. Does it crush on you? As I wrote before - currently to check real size of the file (tested only on linux because OS X FS saves empty blocks to the disk for some reason) you need to get a block count using 'ls -alhs', current patch reserves an empty space for each chunk because we need to do seeks while we write data using SSTableWriter.

          I'm currently working to avoid that (CASSANDRA-2879 is a step in that direction).

          Show
          Pavel Yaskevich added a comment - Tests are passing. Does it crush on you? As I wrote before - currently to check real size of the file (tested only on linux because OS X FS saves empty blocks to the disk for some reason) you need to get a block count using 'ls -alhs', current patch reserves an empty space for each chunk because we need to do seeks while we write data using SSTableWriter. I'm currently working to avoid that ( CASSANDRA-2879 is a step in that direction).
          Hide
          Terje Marthinussen added a comment -

          It starts making the file, but then it crashes.
          java.lang.IndexOutOfBoundsException
          at java.io.RandomAccessFile.readBytes(Native Method)
          at java.io.RandomAccessFile.read(RandomAccessFile.java:338)
          at org.apache.cassandra.io.compress.AbstractCompressedFile.readCompressedChunk(AbstractCompressedFile.java:169)
          at org.apache.cassandra.io.compress.AbstractCompressedFile.reBuffer(AbstractCompressedFile.java:85)
          at org.apache.cassandra.io.compress.CompressedDataFile$Output.writeAtMost(CompressedDataFile.java:420)
          at org.apache.cassandra.io.compress.CompressedDataFile$Output.write(CompressedDataFile.java:404)
          at org.apache.cassandra.db.ColumnIndexer.writeBloomFilter(ColumnIndexer.java:144)
          at org.apache.cassandra.db.ColumnIndexer.serializeInternal(ColumnIndexer.java:114)
          at org.apache.cassandra.db.ColumnIndexer.serialize(ColumnIndexer.java:50)
          at org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:107)
          at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:156)
          at org.apache.cassandra.tools.SSTableImport.importUnsorted(SSTableImport.java:290)
          at org.apache.cassandra.tools.SSTableImport.importJson(SSTableImport.java:252)
          at org.apache.cassandra.tools.SSTableImport.main(SSTableImport.java:476)
          ERROR: null

          I added a bit debug printing to the read/writeChunkLength
          logger.info("TNM: readChunkLength " + len + " in " + this.path + " at: " + super.getFilePointer());

          INFO 08:13:40,652 TNM: writeChunkLength 28511 in /var/folders/8d/8drNFwr6FEe7uEdizf5pAE+++TI/Tmp/Test-h-22-Data.db2879008826574155982compressing at: 1245194
          INFO 08:13:40,655 TNM: writeChunkLength 65538 in /var/folders/8d/8drNFwr6FEe7uEdizf5pAE+++TI/Tmp/Test-h-22-Data.db2879008826574155982compressing at: 1310730
          INFO 08:13:40,656 TNM: readChunkLength 1313507164 in /var/folders/8d/8drNFwr6FEe7uEdizf5pAE+++TI/-Tmp/Test-h-22-Data.db2879008826574155982compressing at: 1376270
          Importing 787 keys...
          Currently imported 1 keys.
          [Some lines deleted]
          INFO 15:10:18,219 TNM: writeChunkLength 14870 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1114122
          INFO 15:10:18,219 TNM: readChunkLength 21106 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1179662
          INFO 15:10:18,229 TNM: writeChunkLength 21104 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1179658
          INFO 15:10:18,258 TNM: readChunkLength 26901 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1245198
          INFO 15:10:18,260 TNM: writeChunkLength 26901 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1245194
          INFO 15:10:18,260 TNM: readChunkLength 21104 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1179662
          INFO 15:10:18,261 TNM: writeChunkLength 21106 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1179658
          INFO 15:10:18,261 TNM: readChunkLength 26901 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1245198
          INFO 15:20:18,348 TNM: writeChunkLength 26901 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1245194
          INFO 15:20:18,370 TNM: readChunkLength 65545 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1310734
          And then the above exception

          Note, this is on OSX and there seems to be another report of problems with OSX above?

          Show
          Terje Marthinussen added a comment - It starts making the file, but then it crashes. java.lang.IndexOutOfBoundsException at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:338) at org.apache.cassandra.io.compress.AbstractCompressedFile.readCompressedChunk(AbstractCompressedFile.java:169) at org.apache.cassandra.io.compress.AbstractCompressedFile.reBuffer(AbstractCompressedFile.java:85) at org.apache.cassandra.io.compress.CompressedDataFile$Output.writeAtMost(CompressedDataFile.java:420) at org.apache.cassandra.io.compress.CompressedDataFile$Output.write(CompressedDataFile.java:404) at org.apache.cassandra.db.ColumnIndexer.writeBloomFilter(ColumnIndexer.java:144) at org.apache.cassandra.db.ColumnIndexer.serializeInternal(ColumnIndexer.java:114) at org.apache.cassandra.db.ColumnIndexer.serialize(ColumnIndexer.java:50) at org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:107) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:156) at org.apache.cassandra.tools.SSTableImport.importUnsorted(SSTableImport.java:290) at org.apache.cassandra.tools.SSTableImport.importJson(SSTableImport.java:252) at org.apache.cassandra.tools.SSTableImport.main(SSTableImport.java:476) ERROR: null I added a bit debug printing to the read/writeChunkLength logger.info("TNM: readChunkLength " + len + " in " + this.path + " at: " + super.getFilePointer()); INFO 08:13:40,652 TNM: writeChunkLength 28511 in /var/folders/8d/8drNFwr6FEe7uEdizf5pAE+++TI/ Tmp /Test-h-22-Data.db2879008826574155982compressing at: 1245194 INFO 08:13:40,655 TNM: writeChunkLength 65538 in /var/folders/8d/8drNFwr6FEe7uEdizf5pAE+++TI/ Tmp /Test-h-22-Data.db2879008826574155982compressing at: 1310730 INFO 08:13:40,656 TNM: readChunkLength 1313507164 in /var/folders/8d/8drNFwr6FEe7uEdizf5pAE+++TI/-Tmp /Test-h-22-Data.db2879008826574155982compressing at: 1376270 Importing 787 keys... Currently imported 1 keys. [Some lines deleted] INFO 15:10:18,219 TNM: writeChunkLength 14870 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1114122 INFO 15:10:18,219 TNM: readChunkLength 21106 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1179662 INFO 15:10:18,229 TNM: writeChunkLength 21104 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1179658 INFO 15:10:18,258 TNM: readChunkLength 26901 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1245198 INFO 15:10:18,260 TNM: writeChunkLength 26901 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1245194 INFO 15:10:18,260 TNM: readChunkLength 21104 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1179662 INFO 15:10:18,261 TNM: writeChunkLength 21106 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1179658 INFO 15:10:18,261 TNM: readChunkLength 26901 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1245198 INFO 15:20:18,348 TNM: writeChunkLength 26901 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1245194 INFO 15:20:18,370 TNM: readChunkLength 65545 in /Users/terje/work/cassandra-0.8/data/JP/Test-h-22-Data.db at: 1310734 And then the above exception Note, this is on OSX and there seems to be another report of problems with OSX above?
          Hide
          Pavel Yaskevich added a comment - - edited

          Thanks for your report! This will be fixed in next patch.

          Show
          Pavel Yaskevich added a comment - - edited Thanks for your report! This will be fixed in next patch.
          Hide
          Sylvain Lebresne added a comment -

          As I wrote before - currently to check real size of the file (tested only on linux because OS X FS saves empty blocks to the disk for some reason) you need to get a block count using 'ls -alhs', current patch reserves an empty space for each chunk because we need to do seeks while we write data using SSTableWriter.

          Yeah, I really think we shouldn't do that (i.e, have empty space between the compressed chunks). I'm happy to learn that linux (or at least whatever file system you are using, I haven't tried the patch on linux yet) is smart enough to avoid allocating empty blocks but we shouldn't rely on this. I bet not all file system do that (osx seems to prove that and I'm not sure all linux FS does this) and anyway if you transfer the sstables or tar them or anything, it'll still be more inefficient than necessary (because the file still is of the size of the uncompressed data). We're also losing some space even on linux depending on what the actual FS block size is (not a big deal, but this can add up). So I think we really need to change the index (and key cache) to store the offset in compressed data. Imho, the simplest way would be to instead of having in the index the key followed by the offset, to have for compressed file, the key, then the position of the chunk in the compressed file, then the offset in the uncompressed chunk.

          Another thing is that we will need that to be optional (if only because we cannot expect people to trust this from day one). Don't get me wrong, it's nice to have a first prototype to have an idea of what we're talking about, but I just wanted to mention this because it's probably easier to take that into account sooner than later (I also suspect we may be able to factor out some of the code of BRAF and CDF, but I haven't look too closely so maybe not).

          Show
          Sylvain Lebresne added a comment - As I wrote before - currently to check real size of the file (tested only on linux because OS X FS saves empty blocks to the disk for some reason) you need to get a block count using 'ls -alhs', current patch reserves an empty space for each chunk because we need to do seeks while we write data using SSTableWriter. Yeah, I really think we shouldn't do that (i.e, have empty space between the compressed chunks). I'm happy to learn that linux (or at least whatever file system you are using, I haven't tried the patch on linux yet) is smart enough to avoid allocating empty blocks but we shouldn't rely on this. I bet not all file system do that (osx seems to prove that and I'm not sure all linux FS does this) and anyway if you transfer the sstables or tar them or anything, it'll still be more inefficient than necessary (because the file still is of the size of the uncompressed data). We're also losing some space even on linux depending on what the actual FS block size is (not a big deal, but this can add up). So I think we really need to change the index (and key cache) to store the offset in compressed data. Imho, the simplest way would be to instead of having in the index the key followed by the offset, to have for compressed file, the key, then the position of the chunk in the compressed file, then the offset in the uncompressed chunk. Another thing is that we will need that to be optional (if only because we cannot expect people to trust this from day one). Don't get me wrong, it's nice to have a first prototype to have an idea of what we're talking about, but I just wanted to mention this because it's probably easier to take that into account sooner than later (I also suspect we may be able to factor out some of the code of BRAF and CDF, but I haven't look too closely so maybe not).
          Hide
          Pavel Yaskevich added a comment -

          Totally agree with you, Sylvain, I'm working in that direction.

          Show
          Pavel Yaskevich added a comment - Totally agree with you, Sylvain, I'm working in that direction.
          Hide
          Terje Marthinussen added a comment -

          Sparse files is indeed a filesystem feature.

          It is standard in extfs, xfs and probably most of the common filesystems on linux (not considering FAT )
          It is not standard in NTFS, but applications can enable sparse files on a file by file basis there. But my bet, without checking, is that this is not really supported by java...

          For OSX, it is not supported on HFS, but is supposedly supported on UFS (but I have not tested on OSX, so only read this).

          Anyway, I agree (from experience) stay away from using sparseness in files if possible. It's a portability headache and it tends to cause major issues when it fails.

          Show
          Terje Marthinussen added a comment - Sparse files is indeed a filesystem feature. It is standard in extfs, xfs and probably most of the common filesystems on linux (not considering FAT ) It is not standard in NTFS, but applications can enable sparse files on a file by file basis there. But my bet, without checking, is that this is not really supported by java... For OSX, it is not supported on HFS, but is supposedly supported on UFS (but I have not tested on OSX, so only read this). Anyway, I agree (from experience) stay away from using sparseness in files if possible. It's a portability headache and it tends to cause major issues when it fails.
          Hide
          Pavel Yaskevich added a comment -

          v2 eliminates need in sparse files, index section added at the end of the file to hold chunk sizes, so header size increased to 18 bytes - 2 control bytes, 8 bytes for uncompressed size and 8 bytes indicating offset of the index section. That approach does not use any additional space except of 4 bytes required to store index section length (at the header of that section), chunk sizes are important information so I don't count size need to store them as overhead.

          Also tried to import/export large files using sstable2json and json2sstable to make sure that it works.

          Show
          Pavel Yaskevich added a comment - v2 eliminates need in sparse files, index section added at the end of the file to hold chunk sizes, so header size increased to 18 bytes - 2 control bytes, 8 bytes for uncompressed size and 8 bytes indicating offset of the index section. That approach does not use any additional space except of 4 bytes required to store index section length (at the header of that section), chunk sizes are important information so I don't count size need to store them as overhead. Also tried to import/export large files using sstable2json and json2sstable to make sure that it works.
          Hide
          Pavel Yaskevich added a comment -

          Forgot to mention that this is rebased with latest trunk (latest commit 4629648899e637e8e03938935f126689cce5ad48)

          Show
          Pavel Yaskevich added a comment - Forgot to mention that this is rebased with latest trunk (latest commit 4629648899e637e8e03938935f126689cce5ad48)
          Hide
          Jonathan Ellis added a comment -

          index section added at the end of the file to hold chunk sizes

          We used to do this with the index entries, but keeping that in memory until you're done can cause a lot of memory pressure. I like the suggestion of moving index entry to (key, compresed-chunk-offset, uncompressed-offset-within-chunk) better.

          Show
          Jonathan Ellis added a comment - index section added at the end of the file to hold chunk sizes We used to do this with the index entries, but keeping that in memory until you're done can cause a lot of memory pressure. I like the suggestion of moving index entry to (key, compresed-chunk-offset, uncompressed-offset-within-chunk) better.
          Hide
          Pavel Yaskevich added a comment -

          I did that for flexibility to use CompressedDataFile without relaying on index file, that index is read only once upon CompressedSegmentedFile completion and then just gets passed to constructor in CSF.getSegment() so even if file is very big like 5-7 GB it will only make about 1 megabyte overhead of keeping that index in memory. Index also allows us to skip reading additional 4 bytes of chunk length from file every time we do re-buffer.

          Show
          Pavel Yaskevich added a comment - I did that for flexibility to use CompressedDataFile without relaying on index file, that index is read only once upon CompressedSegmentedFile completion and then just gets passed to constructor in CSF.getSegment() so even if file is very big like 5-7 GB it will only make about 1 megabyte overhead of keeping that index in memory. Index also allows us to skip reading additional 4 bytes of chunk length from file every time we do re-buffer.
          Hide
          Sylvain Lebresne added a comment -

          Ok, I think I like the idea of keeping the index of chunk sizes. Mostly because it avoids having to change the index (and generally make it so that less part of the code has to be aware that we use compression underneath) and also because it is more compact. A small detail though is that I would store the chunk offsets instead of the chunk sizes, the reason being that it's more resilient to corruption (typically, with chunk sizes, if the first entry is corrupted you're screwed, with offsets, you only have one or two chunks that are unreadable). And in memory, I would probably just store those offsets as a long[], this would be much more compact (my guessestimate is that it will be in the order of 6x small than the list of pair with compressed pointers), and computing the chunk size is just a trivial (and fast) computation.

          I would prefer putting this index and the header into a separate component (a -Compression component ?). Admittedly this is in good part out of personal preference (I like that the -Data file contains only data) and symmetry with what we have so far, but it would also avoid to have to wait that we close the file to write those metadata, which is nice.

          Talking about the header, the control bytes detection is not correct since we haven't done this so far: there is no guarantee an existing data file won't start by the bytes 'C' then 'D' (having or not having a -Compression component could serve this purpose though).
          In that component, there is also at least a few other informations I would want to add:

          • a version number at the beginning (it's always useful)
          • the compression algorithm
          • the chunk size
            Even if a first version of the patch don't allow to configure those, it's likely we'll change that soon and it's just a string and an int to add, so better plan ahead.

          As I said in a previous comment, we also really need to have compression an option. And actually I think this patch have quite a bit of code duplication with BRAF. After all, CompressedDataFile is just a BRAF with a fixed buffer size, and a mechanism to translate pre-compaction file position to compressed file position (roughly). So I'm pretty sure it should be possible to have CompressedDataFile extend BRAF with minimum refactoring (of BRAF that is). It would also lift for free the limitation of not have read-write compressed file (not that we use them but ...).

          Show
          Sylvain Lebresne added a comment - Ok, I think I like the idea of keeping the index of chunk sizes. Mostly because it avoids having to change the index (and generally make it so that less part of the code has to be aware that we use compression underneath) and also because it is more compact. A small detail though is that I would store the chunk offsets instead of the chunk sizes, the reason being that it's more resilient to corruption (typically, with chunk sizes, if the first entry is corrupted you're screwed, with offsets, you only have one or two chunks that are unreadable). And in memory, I would probably just store those offsets as a long[], this would be much more compact (my guessestimate is that it will be in the order of 6x small than the list of pair with compressed pointers), and computing the chunk size is just a trivial (and fast) computation. I would prefer putting this index and the header into a separate component (a -Compression component ?). Admittedly this is in good part out of personal preference (I like that the -Data file contains only data) and symmetry with what we have so far, but it would also avoid to have to wait that we close the file to write those metadata, which is nice. Talking about the header, the control bytes detection is not correct since we haven't done this so far: there is no guarantee an existing data file won't start by the bytes 'C' then 'D' (having or not having a -Compression component could serve this purpose though). In that component, there is also at least a few other informations I would want to add: a version number at the beginning (it's always useful) the compression algorithm the chunk size Even if a first version of the patch don't allow to configure those, it's likely we'll change that soon and it's just a string and an int to add, so better plan ahead. As I said in a previous comment, we also really need to have compression an option. And actually I think this patch have quite a bit of code duplication with BRAF. After all, CompressedDataFile is just a BRAF with a fixed buffer size, and a mechanism to translate pre-compaction file position to compressed file position (roughly). So I'm pretty sure it should be possible to have CompressedDataFile extend BRAF with minimum refactoring (of BRAF that is). It would also lift for free the limitation of not have read-write compressed file (not that we use them but ...).
          Hide
          Jonathan Ellis added a comment -

          I would prefer putting this index and the header into a separate component (a -Compression component ?) ... it would also avoid to have to wait that we close the file to write those metadata

          +1

          a version number at the beginning

          Are we really going to version compression separately from the "main" sstable version? Seems like it will be rare enough that we only need the latter.

          the compression algorithm, the chunk size

          +1, that's good futureproofing.

          we also really need to have compression an option

          +1

          Show
          Jonathan Ellis added a comment - I would prefer putting this index and the header into a separate component (a -Compression component ?) ... it would also avoid to have to wait that we close the file to write those metadata +1 a version number at the beginning Are we really going to version compression separately from the "main" sstable version? Seems like it will be rare enough that we only need the latter. the compression algorithm, the chunk size +1, that's good futureproofing. we also really need to have compression an option +1
          Hide
          Pavel Yaskevich added a comment - - edited

          A small detail though is that I would store the chunk offsets instead of the chunk sizes, the reason being that it's more resilient to corruption (typically, with chunk sizes, if the first entry is corrupted you're screwed, with offsets, you only have one or two chunks that are unreadable).

          +1 if we will go with a separate file. I'm thinking if we will go with a separate file I will use the same strategy as I did in v1 - store chunk size at the beginning of the chunk and re-read it instead of keeping it in memory (lowers memory usage for larger files).

          After all, CompressedDataFile is just a BRAF with a fixed buffer size, and a mechanism to translate pre-compaction file position to compressed file position (roughly). So I'm pretty sure it should be possible to have CompressedDataFile extend BRAF with minimum refactoring (of BRAF that is). It would also lift for free the limitation of not have read-write compressed file (not that we use them but ...).

          To extend BRAF we will need to split it into Input/Output classes which will imply refactoring of skip cache functionality and other parts of that class. I'd rather create a separate issue to do that after compression is committed instead of putting all eggs in one basket.

          +1 on everything else.

          Show
          Pavel Yaskevich added a comment - - edited A small detail though is that I would store the chunk offsets instead of the chunk sizes, the reason being that it's more resilient to corruption (typically, with chunk sizes, if the first entry is corrupted you're screwed, with offsets, you only have one or two chunks that are unreadable). +1 if we will go with a separate file. I'm thinking if we will go with a separate file I will use the same strategy as I did in v1 - store chunk size at the beginning of the chunk and re-read it instead of keeping it in memory (lowers memory usage for larger files). After all, CompressedDataFile is just a BRAF with a fixed buffer size, and a mechanism to translate pre-compaction file position to compressed file position (roughly). So I'm pretty sure it should be possible to have CompressedDataFile extend BRAF with minimum refactoring (of BRAF that is). It would also lift for free the limitation of not have read-write compressed file (not that we use them but ...). To extend BRAF we will need to split it into Input/Output classes which will imply refactoring of skip cache functionality and other parts of that class. I'd rather create a separate issue to do that after compression is committed instead of putting all eggs in one basket. +1 on everything else.
          Hide
          Sylvain Lebresne added a comment -

          Are we really going to version compression separately from the "main" sstable version?

          Yeah, I guess using the "main" sstable version is ok.

          I'm thinking if we will go with a separate file I will use the same strategy as I did in v1 - store chunk size at the beginning of the chunk and re-read it

          Just so that we agree, if you have the offsets to the beginning of the chunk, you don't really need the chunk sizes. That being said, it could be useful to have the size in front of the chunk to be able to uncompress a data file even if the 'chunk index' got screwed up (but it will be redundant info).

          To extend BRAF we will need to split it into Input/Output classes will imply refactoring of skip cache functionality and other parts of that class

          Why ? Is there something in compressed files that requires to have an Input and a Output class ? Can't we just have CDF seek() method throw an exception if the CDF has been opened in "rw" mode ? And if we don't split it (which again, I don't see why we would have to, but maybe I'm missing something), I'm pretty sure there is very little parts that will require refactoring (skip cache isn't one of them, CDF will just set skipCache to false; even though I don't see why skip cache would be a problem with compression).
          In any case, having compression optional is a requirement and in my book, the more important one. To be clear, I'm -1 on committing anything where compression is not optional (we cannot ask people to trust compression on day 1, and I strongly think that the "let's commit and fix after" is the wrong way to go). So we at least need CDF and BRAF to have some common ancestor for that.

          Show
          Sylvain Lebresne added a comment - Are we really going to version compression separately from the "main" sstable version? Yeah, I guess using the "main" sstable version is ok. I'm thinking if we will go with a separate file I will use the same strategy as I did in v1 - store chunk size at the beginning of the chunk and re-read it Just so that we agree, if you have the offsets to the beginning of the chunk, you don't really need the chunk sizes. That being said, it could be useful to have the size in front of the chunk to be able to uncompress a data file even if the 'chunk index' got screwed up (but it will be redundant info). To extend BRAF we will need to split it into Input/Output classes will imply refactoring of skip cache functionality and other parts of that class Why ? Is there something in compressed files that requires to have an Input and a Output class ? Can't we just have CDF seek() method throw an exception if the CDF has been opened in "rw" mode ? And if we don't split it (which again, I don't see why we would have to, but maybe I'm missing something), I'm pretty sure there is very little parts that will require refactoring (skip cache isn't one of them, CDF will just set skipCache to false; even though I don't see why skip cache would be a problem with compression). In any case, having compression optional is a requirement and in my book, the more important one. To be clear, I'm -1 on committing anything where compression is not optional (we cannot ask people to trust compression on day 1, and I strongly think that the "let's commit and fix after" is the wrong way to go). So we at least need CDF and BRAF to have some common ancestor for that.
          Hide
          Pavel Yaskevich added a comment -

          Why ? Is there something in compressed files that requires to have an Input and a Output class ? Can't we just have CDF seek() method throw an exception if the CDF has been opened in "rw" mode ? And if we don't split it (which again, I don't see why we would have to, but maybe I'm missing something), I'm pretty sure there is very little parts that will require refactoring (skip cache isn't one of them, CDF will just set skipCache to false; even though I don't see why skip cache would be a problem with compression).

          The thing about Input/Output classes was mentioned previously at CASSANDRA-1470. I -1 doing "seek() method throw an exception if the CDF has been opened in "rw" mode" because this is not a clean interface but I rather prefer to make separate classes as that will be a more reasonable and clean design. Anyway, even right now common ancestor of both is RandomAccessFile (or even FileDataInput). So I -1 doing merge of CDF and BRAF before we have a BRAF refactored.

          In any case, having compression optional is a requirement and in my book, the more important one. To be clear, I'm -1 on committing anything where compression is not optional (we cannot ask people to trust compression on day 1, and I strongly think that the "let's commit and fix after" is the wrong way to go). So we at least need CDF and BRAF to have some common ancestor for that.

          To be clear, I'm not proposing "let's commit and fix after", compaction can be make optional easily with current state of the patch and I'm making it my top priority.

          I would prefer putting this index and the header into a separate component (a -Compression component ?).

          Thinking about that further - I'm a bit conserved about adding one more file to handle a single SSTable, main goal of my design here was to make CDF independent from other components of the system to avoid any additional complexity so maybe it's better to stream file offsets to the temporary file while SSTable being written and after that store index section at the end of the file (as a conter-action of keeping that index in memory)?

          Talking about the header, the control bytes detection is not correct since we haven't done this so far: there is no guarantee an existing data file won't start by the bytes 'C' then 'D' (having or not having a -Compression component could serve this purpose though).

          We can use a magic number the same way as gzip does http://en.wikipedia.org/wiki/Gzip#File_format.

          Show
          Pavel Yaskevich added a comment - Why ? Is there something in compressed files that requires to have an Input and a Output class ? Can't we just have CDF seek() method throw an exception if the CDF has been opened in "rw" mode ? And if we don't split it (which again, I don't see why we would have to, but maybe I'm missing something), I'm pretty sure there is very little parts that will require refactoring (skip cache isn't one of them, CDF will just set skipCache to false; even though I don't see why skip cache would be a problem with compression). The thing about Input/Output classes was mentioned previously at CASSANDRA-1470 . I -1 doing "seek() method throw an exception if the CDF has been opened in "rw" mode" because this is not a clean interface but I rather prefer to make separate classes as that will be a more reasonable and clean design. Anyway, even right now common ancestor of both is RandomAccessFile (or even FileDataInput). So I -1 doing merge of CDF and BRAF before we have a BRAF refactored. In any case, having compression optional is a requirement and in my book, the more important one. To be clear, I'm -1 on committing anything where compression is not optional (we cannot ask people to trust compression on day 1, and I strongly think that the "let's commit and fix after" is the wrong way to go). So we at least need CDF and BRAF to have some common ancestor for that. To be clear, I'm not proposing "let's commit and fix after", compaction can be make optional easily with current state of the patch and I'm making it my top priority. I would prefer putting this index and the header into a separate component (a -Compression component ?). Thinking about that further - I'm a bit conserved about adding one more file to handle a single SSTable, main goal of my design here was to make CDF independent from other components of the system to avoid any additional complexity so maybe it's better to stream file offsets to the temporary file while SSTable being written and after that store index section at the end of the file (as a conter-action of keeping that index in memory)? Talking about the header, the control bytes detection is not correct since we haven't done this so far: there is no guarantee an existing data file won't start by the bytes 'C' then 'D' (having or not having a -Compression component could serve this purpose though). We can use a magic number the same way as gzip does http://en.wikipedia.org/wiki/Gzip#File_format .
          Hide
          Sylvain Lebresne added a comment -

          The thing about Input/Output classes was mentioned previously at CASSANDRA-1470. I -1 doing "seek() method throw an exception if the CDF has been opened in "rw" mode" because this is not a clean interface but I rather prefer to make separate classes as that will be a more reasonable and clean design. Anyway, even right now common ancestor of both is RandomAccessFile (or even FileDataInput). So I -1 doing merge of CDF and BRAF before we have a BRAF refactored.

          I don't understand that argument. BRAF and CDF do the same thing, they only differ in that CDF has a decompression/compression step while moving data in/out of the buffer and has a slight translation between which part of the file to buffer. The rest of the code is the exact same, all the buffer manipulation, when to sync, when to rebuffer, etc.. is the same. And it's not the simplest code ever, not a place where having code duplication sound like a good idea.

          I'm a bit conserved about adding one more file to handle a single SSTable, main goal of my design here was to make CDF independent from other components of the system to avoid any additional complexity

          I don't see why adding a new component adds any complexity. I actually find it rather cleaner, as that component would likely nicely correspond to an in-memory object holding all the metadata related to compression.

          maybe it's better to stream file offsets to the temporary file while SSTable being written and after that store index section at the end of the file

          If what you mean is what I understand that sound way more complicated that having a separate component.

          We can use a magic number the same way as gzip does http://en.wikipedia.org/wiki/Gzip#File_format.

          That wouldn't be more reliable than the control bytes.

          Show
          Sylvain Lebresne added a comment - The thing about Input/Output classes was mentioned previously at CASSANDRA-1470 . I -1 doing "seek() method throw an exception if the CDF has been opened in "rw" mode" because this is not a clean interface but I rather prefer to make separate classes as that will be a more reasonable and clean design. Anyway, even right now common ancestor of both is RandomAccessFile (or even FileDataInput). So I -1 doing merge of CDF and BRAF before we have a BRAF refactored. I don't understand that argument. BRAF and CDF do the same thing, they only differ in that CDF has a decompression/compression step while moving data in/out of the buffer and has a slight translation between which part of the file to buffer. The rest of the code is the exact same, all the buffer manipulation, when to sync, when to rebuffer, etc.. is the same. And it's not the simplest code ever, not a place where having code duplication sound like a good idea. I'm a bit conserved about adding one more file to handle a single SSTable, main goal of my design here was to make CDF independent from other components of the system to avoid any additional complexity I don't see why adding a new component adds any complexity. I actually find it rather cleaner, as that component would likely nicely correspond to an in-memory object holding all the metadata related to compression. maybe it's better to stream file offsets to the temporary file while SSTable being written and after that store index section at the end of the file If what you mean is what I understand that sound way more complicated that having a separate component. We can use a magic number the same way as gzip does http://en.wikipedia.org/wiki/Gzip#File_format . That wouldn't be more reliable than the control bytes.
          Hide
          Pavel Yaskevich added a comment -

          I don't understand that argument. BRAF and CDF do the same thing, they only differ in that CDF has a decompression/compression step while moving data in/out of the buffer and has a slight translation between which part of the file to buffer. The rest of the code is the exact same, all the buffer manipulation, when to sync, when to rebuffer, etc.. is the same. And it's not the simplest code ever, not a place where having code duplication sound like a good idea.

          The argument is - I am willing to do a "merge" but would like to do that in the separate task because that would involve refactoring for BRAF (I wrote that previously), I don't like to use BRAF in the current state where Input/Output operations aren't split.

          I don't see why adding a new component adds any complexity.

          Because having a new component for metadata means that you won't be able to open and read a compressed file without previously open and read metadata, this is not better then holding information about chunks in the INDEX component.

          Show
          Pavel Yaskevich added a comment - I don't understand that argument. BRAF and CDF do the same thing, they only differ in that CDF has a decompression/compression step while moving data in/out of the buffer and has a slight translation between which part of the file to buffer. The rest of the code is the exact same, all the buffer manipulation, when to sync, when to rebuffer, etc.. is the same. And it's not the simplest code ever, not a place where having code duplication sound like a good idea. The argument is - I am willing to do a "merge" but would like to do that in the separate task because that would involve refactoring for BRAF (I wrote that previously), I don't like to use BRAF in the current state where Input/Output operations aren't split. I don't see why adding a new component adds any complexity. Because having a new component for metadata means that you won't be able to open and read a compressed file without previously open and read metadata, this is not better then holding information about chunks in the INDEX component.
          Hide
          Jonathan Ellis added a comment -

          I don't like to use BRAF in the current state where Input/Output operations aren't split

          I agree that it feels fragile the way it is now.

          A lot of the complexity is from supporting seeks on writes – with CASSANDRA-2879 finished we don't need that feature anymore. Creating a random-access reader, and an append-only writer would simplify things a lot IMO.

          Show
          Jonathan Ellis added a comment - I don't like to use BRAF in the current state where Input/Output operations aren't split I agree that it feels fragile the way it is now. A lot of the complexity is from supporting seeks on writes – with CASSANDRA-2879 finished we don't need that feature anymore. Creating a random-access reader, and an append-only writer would simplify things a lot IMO.
          Hide
          Pavel Yaskevich added a comment -

          Option use_sstable_compression: (boolean) is added to conf/cassandra.yaml

          CompressedDataFile removed and splited into CompressedRandomAccessReader (extends RandomAccessReader) and CompressedSequentialWriter extends SequentialWriter).

          No attempt is made to compress old SSTables.

          New component is added to SSTable - CompressionInfo it holds up information about chunk length, uncompressed data length and chunk offsets.

          Show
          Pavel Yaskevich added a comment - Option use_sstable_compression: (boolean) is added to conf/cassandra.yaml CompressedDataFile removed and splited into CompressedRandomAccessReader (extends RandomAccessReader) and CompressedSequentialWriter extends SequentialWriter). No attempt is made to compress old SSTables. New component is added to SSTable - CompressionInfo it holds up information about chunk length, uncompressed data length and chunk offsets.
          Hide
          Pavel Yaskevich added a comment -

          rebased with latest trunk (last commit 3ac0d33e31d1829af7cbd0e3ff127357975716f0)

          Show
          Pavel Yaskevich added a comment - rebased with latest trunk (last commit 3ac0d33e31d1829af7cbd0e3ff127357975716f0)
          Hide
          Sylvain Lebresne added a comment -

          I think we're on the right track. A bunch of comments though:

          • I would put the compression option at the CF level. I think it will be particularity useful at first, when people will want to test compression before trusting it, to be able to switch a few less important CFs to compression first. On the longer term, compression has probably a little cost for some workload (especially considering we don't yet have mmaped compressed file), so it'd be also useful for that. And there is the fact that it's not really more effort on our side. Nit: I would also call the option more concisely 'compress_sstables'.
          • I would move the Metadata class out of CRAR (and call it CompressionMetadata). There is two reasons:
            • It is useful outside of CRAR: an example already is CompressedSegmentedFile. The other will be when/if we add a MMappedCompressedSegmentFile.
            • I would also move all the code related to the metadata there. For instance, we can have a CompressionMetadata.Writer with the code related to writing those that CSW would use. That would improve encapsulation I think.
          • About the Metadata, we should really use a long[] instead of List<Long> for the offsets, the memory size difference will be significant. Also, since this is a fairly performance sensitive path, I would create a small Chunk class with a long and an int field to replace the Pair<Long, Integer> returned by chunckFor, to avoid the boxing/unboxing.
          • Let's add the 'compression algorithm' in the compressionInfo component. It's fine to hard set it to "Snappy" for writes and ignore the value on read for now.
          • In CFS.scrubDataDirectory, it seems that if there is a compressionInfo segment but no data file, we do not purge orphaned files. I believe this is wrong.
          • CRAR.transfer() should honor FileStreamTask.CHUNK_SIZE (as a max at least) for it's buffer. In particular, having the possibility to allow a 2GB buffer as now (and it's not even unrealistic) is not a good idea.
          • CompressedSequentialWriter does not honor skipIOCache during flush. I actually think that flush() is a bit problematic for CSW in that it should only be called when the buffer is full (or it is the end). It's obviously true of reBuffer and sync too then they use flush. But flush() and sync() are public. Maybe we could rename flush and sync (in say flushInternal and syncInternal) and make the public method throw an unsupported exception. Anyway, I think it would be cleaner if the flush of CSW was calling the one of SW (this would solve the skipIOCache problem in particular). We can move the buffer write into a protected method in SW and only override that in CSW, and let the flush of SW untouched otherwise.
          • Not totally related to that patch but it seems that in SW, when you call reBuffer, there is one call to resetBuffer in flush() and one after it. It seems the one in reBuffer is useless (that would make reBuffer be an alias to flush() really).
          • In CSW, why not reuse the buffer where compressed data is used ?
          • I think the truncate methods are not correct for CSW. The offset given is a position in the uncompressed data, so it will not truncate where it should. For truncateAndClose(), we should probably not have it public anyway: the only place it is used, it's really close() that you want to use, and we can do "the right thing" in close for CSW. For truncate, it should probably throw an unsupported exception.
            For ResetAndTruncate, I think we have a harder problem. What the method should do is: seek back to the chunk containing the wanted destination (imply keeping chunk offsets in memory), load the chunk in the buffer and position the bufferOffset correctly.

          And to end, a few nits:

          • In SegmentedFile, I would prefer adding a 'compressed' flag to getBuilder() instead of having getCompressedBuilder. If only because we should probably add a mmapped compressed mode at some point.
          • There's an useless added import in CompactionManager. Also a wrongly placed import in SSTW.
          • In CRAR, I would use metada.chunkLength instead buffer.length in most places (decompressChunk mainly). I know it is the same value right now, but it's really chunkLength that we want here so it feels more clear (nothing prevents us from using a buffer larger than chunkLength, not that it would be useful)
          • In SSTR and SSTW, we can use the isCompressed SSTable flag instead of 'if (components.contains(Component.COMPRESSION_INFO))'.
          Show
          Sylvain Lebresne added a comment - I think we're on the right track. A bunch of comments though: I would put the compression option at the CF level. I think it will be particularity useful at first, when people will want to test compression before trusting it, to be able to switch a few less important CFs to compression first. On the longer term, compression has probably a little cost for some workload (especially considering we don't yet have mmaped compressed file), so it'd be also useful for that. And there is the fact that it's not really more effort on our side. Nit: I would also call the option more concisely 'compress_sstables'. I would move the Metadata class out of CRAR (and call it CompressionMetadata). There is two reasons: It is useful outside of CRAR: an example already is CompressedSegmentedFile. The other will be when/if we add a MMappedCompressedSegmentFile. I would also move all the code related to the metadata there. For instance, we can have a CompressionMetadata.Writer with the code related to writing those that CSW would use. That would improve encapsulation I think. About the Metadata, we should really use a long[] instead of List<Long> for the offsets, the memory size difference will be significant. Also, since this is a fairly performance sensitive path, I would create a small Chunk class with a long and an int field to replace the Pair<Long, Integer> returned by chunckFor, to avoid the boxing/unboxing. Let's add the 'compression algorithm' in the compressionInfo component. It's fine to hard set it to "Snappy" for writes and ignore the value on read for now. In CFS.scrubDataDirectory, it seems that if there is a compressionInfo segment but no data file, we do not purge orphaned files. I believe this is wrong. CRAR.transfer() should honor FileStreamTask.CHUNK_SIZE (as a max at least) for it's buffer. In particular, having the possibility to allow a 2GB buffer as now (and it's not even unrealistic) is not a good idea. CompressedSequentialWriter does not honor skipIOCache during flush. I actually think that flush() is a bit problematic for CSW in that it should only be called when the buffer is full (or it is the end). It's obviously true of reBuffer and sync too then they use flush. But flush() and sync() are public. Maybe we could rename flush and sync (in say flushInternal and syncInternal) and make the public method throw an unsupported exception. Anyway, I think it would be cleaner if the flush of CSW was calling the one of SW (this would solve the skipIOCache problem in particular). We can move the buffer write into a protected method in SW and only override that in CSW, and let the flush of SW untouched otherwise. Not totally related to that patch but it seems that in SW, when you call reBuffer, there is one call to resetBuffer in flush() and one after it. It seems the one in reBuffer is useless (that would make reBuffer be an alias to flush() really). In CSW, why not reuse the buffer where compressed data is used ? I think the truncate methods are not correct for CSW. The offset given is a position in the uncompressed data, so it will not truncate where it should. For truncateAndClose(), we should probably not have it public anyway: the only place it is used, it's really close() that you want to use, and we can do "the right thing" in close for CSW. For truncate, it should probably throw an unsupported exception. For ResetAndTruncate, I think we have a harder problem. What the method should do is: seek back to the chunk containing the wanted destination (imply keeping chunk offsets in memory), load the chunk in the buffer and position the bufferOffset correctly. And to end, a few nits: In SegmentedFile, I would prefer adding a 'compressed' flag to getBuilder() instead of having getCompressedBuilder. If only because we should probably add a mmapped compressed mode at some point. There's an useless added import in CompactionManager. Also a wrongly placed import in SSTW. In CRAR, I would use metada.chunkLength instead buffer.length in most places (decompressChunk mainly). I know it is the same value right now, but it's really chunkLength that we want here so it feels more clear (nothing prevents us from using a buffer larger than chunkLength, not that it would be useful) In SSTR and SSTW, we can use the isCompressed SSTable flag instead of 'if (components.contains(Component.COMPRESSION_INFO))'.
          Hide
          Pavel Yaskevich added a comment - - edited

          We need to decide do we need to do this per CF or at the global level.

          I don't think that mmap of the compressed file is a good idea because we anyway won't be able to avoid a buffer copies as we do with uncompressed data (see MappedFileDataInput). Agree with other arguments not related to mmap mode.

          Let's add the 'compression algorithm' in the compressionInfo component. It's fine to hard set it to "Snappy" for writes and ignore the value on read for now.

          We will need that field to be fixed size or size + value because just writing a string at the header could potentially be dangerous.

          In SSTR and SSTW, we can use the isCompressed SSTable flag instead of 'if (components.contains(Component.COMPRESSION_INFO))'.

          I will remove one use of it in the SSTW but in the SSTR it is used in the static method where we don't have isCompressed flag.

          Show
          Pavel Yaskevich added a comment - - edited We need to decide do we need to do this per CF or at the global level. I don't think that mmap of the compressed file is a good idea because we anyway won't be able to avoid a buffer copies as we do with uncompressed data (see MappedFileDataInput). Agree with other arguments not related to mmap mode. Let's add the 'compression algorithm' in the compressionInfo component. It's fine to hard set it to "Snappy" for writes and ignore the value on read for now. We will need that field to be fixed size or size + value because just writing a string at the header could potentially be dangerous. In SSTR and SSTW, we can use the isCompressed SSTable flag instead of 'if (components.contains(Component.COMPRESSION_INFO))'. I will remove one use of it in the SSTW but in the SSTR it is used in the static method where we don't have isCompressed flag.
          Hide
          T Jake Luciani added a comment -

          This should be a per-cf option IMO with a default of "compressed"

          Show
          T Jake Luciani added a comment - This should be a per-cf option IMO with a default of "compressed"
          Hide
          Pavel Yaskevich added a comment -

          I concur, we can drop a global option in that case.

          Show
          Pavel Yaskevich added a comment - I concur, we can drop a global option in that case.
          Hide
          Sylvain Lebresne added a comment -

          This should be a per-cf option IMO with a default of "compressed"

          I'm -1 on the default to compressed. This would mean upon upgrade, people would get their file compressed unless they do something about it. This is new code and we're talking about the data files. I will be happy to set the default to compressed in a year, but not now.

          Show
          Sylvain Lebresne added a comment - This should be a per-cf option IMO with a default of "compressed" I'm -1 on the default to compressed. This would mean upon upgrade, people would get their file compressed unless they do something about it. This is new code and we're talking about the data files. I will be happy to set the default to compressed in a year, but not now.
          Hide
          Pavel Yaskevich added a comment -

          I agree with Sylvain, we should do that as CF option "compression" (or better name) by default set to 'false'

          Show
          Pavel Yaskevich added a comment - I agree with Sylvain, we should do that as CF option "compression" (or better name) by default set to 'false'
          Hide
          Pavel Yaskevich added a comment -
          • I would put the compression option at the CF level. I think it will be particularity useful at first, when people will want to test compression before trusting it, to be able to switch a few less important CFs to compression first. On the longer term, compression has probably a little cost for some workload (especially considering we don't yet have mmaped compressed file), so it'd be also useful for that. And there is the fact that it's not really more effort on our side. Nit: I would also call the option more concisely 'compress_sstables'.

          Added option to CF 'compression' + made CLI to recognize it (CLI help is updated also). Removed that option from cassandra.yaml and related Config and DatabaseDescriptor classes.

          • I would move the Metadata class out of CRAR (and call it CompressionMetadata). There is two reasons:

          Extracted Metadata class from CRAR, named it CompressionMetadata and added Writer to it + changed CSW to support that.

          • About the Metadata, we should really use a long[] instead of List<Long> for the offsets, the memory size difference will be significant. Also, since this is a fairly performance sensitive path, I would create a small Chunk class with a long and an int field to replace the Pair<Long, Integer> returned by chunckFor, to avoid the boxing/unboxing.

          Done both - changed List<Long> to long[] and Pair<Long, Integer> to Chunk.

          • Let's add the 'compression algorithm' in the compressionInfo component. It's fine to hard set it to "Snappy" for writes and ignore the value on read for now.

          Algorithm name is added to the header of CompressionInfo.db in "size + data" format, it is accessible by "algorithm" variable from CompressionMetadata

          • In CFS.scrubDataDirectory, it seems that if there is a compressionInfo segment but no data file, we do not purge orphaned files. I believe this is wrong.

          Fixed.

          • CRAR.transfer() should honor FileStreamTask.CHUNK_SIZE (as a max at least) for it's buffer. In particular, having the possibility to allow a 2GB buffer as now (and it's not even unrealistic) is not a good idea.

          Fixed.

          • CompressedSequentialWriter does not honor skipIOCache during flush. I actually think that flush() is a bit problematic for CSW in that it should only be called when the buffer is full (or it is the end). It's obviously true of reBuffer and sync too then they use flush. But flush() and sync() are public. Maybe we could rename flush and sync (in say flushInternal and syncInternal) and make the public method throw an unsupported exception. Anyway, I think it would be cleaner if the flush of CSW was calling the one of SW (this would solve the skipIOCache problem in particular). We can move the buffer write into a protected method in SW and only override that in CSW, and let the flush of SW untouched otherwise.

          Made a protected flushData method which is now overridden in CSW instead of flush(), didn't change visibility of the sync and flush methods because this requires further consideration and should be done in the separate task.

          • Not totally related to that patch but it seems that in SW, when you call reBuffer, there is one call to resetBuffer in flush() and one after it. It seems the one in reBuffer is useless (that would make reBuffer be an alias to flush() really).

          While we have a public flush method resetBuffer() call in there should stay.

          • In CSW, why not reuse the buffer where compressed data is used ?

          Fixed.

          • I think the truncate methods are not correct for CSW. The offset given is a position in the uncompressed data, so it will not truncate where it should. For truncateAndClose(), we should probably not have it public anyway: the only place it is used, it's really close() that you want to use, and we can do "the right thing" in close for CSW. For truncate, it should probably throw an unsupported exception.
            For ResetAndTruncate, I think we have a harder problem. What the method should do is: seek back to the chunk containing the wanted destination (imply keeping chunk offsets in memory), load the chunk in the buffer and position the bufferOffset correctly.

          Made truncateAndClose() protected, changed SSTW to use close (with comment), mark() method is overridden in CSW and creates a FileMark with chunk offset instead of uncompressed offset.

          • In SegmentedFile, I would prefer adding a 'compressed' flag to getBuilder() instead of having getCompressedBuilder. If only because we should probably add a mmapped compressed mode at some point.

          That would be changed when and if we decide to use mmaped I/O with compressed files.

          • There's an useless added import in CompactionManager. Also a wrongly placed import in SSTW.

          Fixed.

          • In CRAR, I would use metada.chunkLength instead buffer.length in most places (decompressChunk mainly). I know it is the same value right now, but it's really chunkLength that we want here so it feels more clear (nothing prevents us from using a buffer larger than chunkLength, not that it would be useful)

          buffer.length always equals to chunkLength, we can't have a buffer of other size because of decompression needs.

          • In SSTR and SSTW, we can use the isCompressed SSTable flag instead of 'if (components.contains(Component.COMPRESSION_INFO))'.

          Removed from SSTW but in the SSTR it is used in the static method where we don't have isCompressed flag.

          Show
          Pavel Yaskevich added a comment - I would put the compression option at the CF level. I think it will be particularity useful at first, when people will want to test compression before trusting it, to be able to switch a few less important CFs to compression first. On the longer term, compression has probably a little cost for some workload (especially considering we don't yet have mmaped compressed file), so it'd be also useful for that. And there is the fact that it's not really more effort on our side. Nit: I would also call the option more concisely 'compress_sstables'. Added option to CF 'compression' + made CLI to recognize it (CLI help is updated also). Removed that option from cassandra.yaml and related Config and DatabaseDescriptor classes. I would move the Metadata class out of CRAR (and call it CompressionMetadata). There is two reasons: Extracted Metadata class from CRAR, named it CompressionMetadata and added Writer to it + changed CSW to support that. About the Metadata, we should really use a long[] instead of List<Long> for the offsets, the memory size difference will be significant. Also, since this is a fairly performance sensitive path, I would create a small Chunk class with a long and an int field to replace the Pair<Long, Integer> returned by chunckFor, to avoid the boxing/unboxing. Done both - changed List<Long> to long[] and Pair<Long, Integer> to Chunk. Let's add the 'compression algorithm' in the compressionInfo component. It's fine to hard set it to "Snappy" for writes and ignore the value on read for now. Algorithm name is added to the header of CompressionInfo.db in "size + data" format, it is accessible by "algorithm" variable from CompressionMetadata In CFS.scrubDataDirectory, it seems that if there is a compressionInfo segment but no data file, we do not purge orphaned files. I believe this is wrong. Fixed. CRAR.transfer() should honor FileStreamTask.CHUNK_SIZE (as a max at least) for it's buffer. In particular, having the possibility to allow a 2GB buffer as now (and it's not even unrealistic) is not a good idea. Fixed. CompressedSequentialWriter does not honor skipIOCache during flush. I actually think that flush() is a bit problematic for CSW in that it should only be called when the buffer is full (or it is the end). It's obviously true of reBuffer and sync too then they use flush. But flush() and sync() are public. Maybe we could rename flush and sync (in say flushInternal and syncInternal) and make the public method throw an unsupported exception. Anyway, I think it would be cleaner if the flush of CSW was calling the one of SW (this would solve the skipIOCache problem in particular). We can move the buffer write into a protected method in SW and only override that in CSW, and let the flush of SW untouched otherwise. Made a protected flushData method which is now overridden in CSW instead of flush(), didn't change visibility of the sync and flush methods because this requires further consideration and should be done in the separate task. Not totally related to that patch but it seems that in SW, when you call reBuffer, there is one call to resetBuffer in flush() and one after it. It seems the one in reBuffer is useless (that would make reBuffer be an alias to flush() really). While we have a public flush method resetBuffer() call in there should stay. In CSW, why not reuse the buffer where compressed data is used ? Fixed. I think the truncate methods are not correct for CSW. The offset given is a position in the uncompressed data, so it will not truncate where it should. For truncateAndClose(), we should probably not have it public anyway: the only place it is used, it's really close() that you want to use, and we can do "the right thing" in close for CSW. For truncate, it should probably throw an unsupported exception. For ResetAndTruncate, I think we have a harder problem. What the method should do is: seek back to the chunk containing the wanted destination (imply keeping chunk offsets in memory), load the chunk in the buffer and position the bufferOffset correctly. Made truncateAndClose() protected, changed SSTW to use close (with comment), mark() method is overridden in CSW and creates a FileMark with chunk offset instead of uncompressed offset. In SegmentedFile, I would prefer adding a 'compressed' flag to getBuilder() instead of having getCompressedBuilder. If only because we should probably add a mmapped compressed mode at some point. That would be changed when and if we decide to use mmaped I/O with compressed files. There's an useless added import in CompactionManager. Also a wrongly placed import in SSTW. Fixed. In CRAR, I would use metada.chunkLength instead buffer.length in most places (decompressChunk mainly). I know it is the same value right now, but it's really chunkLength that we want here so it feels more clear (nothing prevents us from using a buffer larger than chunkLength, not that it would be useful) buffer.length always equals to chunkLength, we can't have a buffer of other size because of decompression needs. In SSTR and SSTW, we can use the isCompressed SSTable flag instead of 'if (components.contains(Component.COMPRESSION_INFO))'. Removed from SSTW but in the SSTR it is used in the static method where we don't have isCompressed flag.
          Hide
          Pavel Yaskevich added a comment -

          you will need to do "ant realclean" because internode.avro and cassandra.thrift files were modified and please don't forget to commit interface/thrift/* modifications.

          Show
          Pavel Yaskevich added a comment - you will need to do "ant realclean" because internode.avro and cassandra.thrift files were modified and please don't forget to commit interface/thrift/* modifications.
          Hide
          Brandon Williams added a comment -

          ISTM that having a generic 'compression' option is not the best way to go, since in the future something better may come along (imagine if we had started with LZO and then snappy was released.)

          Show
          Brandon Williams added a comment - ISTM that having a generic 'compression' option is not the best way to go, since in the future something better may come along (imagine if we had started with LZO and then snappy was released.)
          Hide
          Pavel Yaskevich added a comment -

          That option will be sufficient if we leave a choose what algorithm/lib to use for us and have information about algorithm at the header to support older files.

          Show
          Pavel Yaskevich added a comment - That option will be sufficient if we leave a choose what algorithm/lib to use for us and have information about algorithm at the header to support older files.
          Hide
          Jonathan Ellis added a comment -

          I don't see a reason to have two settings (on/off and type) when one (type/null) will do.

          Show
          Jonathan Ellis added a comment - I don't see a reason to have two settings (on/off and type) when one (type/null) will do.
          Hide
          Terje Marthinussen added a comment -

          Instead of on/off we could use size.

          In the cassandra we run, we have compression implemented on a supercolumn level.

          It turned out to be very good for performance for us not to compress data in memtables (which we would normally do with compression on supercolumns) or during flushing from memtables as both of these caused slowdown in the write path.

          Under heavy write activity, the resulting sstables from memtable flushes often gets pretty small (maybe avg. 20MB in our case) so compression does not really make much difference on disk consumption there, but the performance penalty does.

          All the compression/decompression on compacting the smallest tables also makes a noticable difference when trying to keep up on the compaction side.

          Instead we went for compression which only happens when a source sstable during compaction is larger than 4GB.

          I would recommend to consider similar functionality here.

          I started off with ning for our compression, but I now run the built in java deflate to get even better compression. Since we only compress the largest sstables, and do no other compression in the write path or on compaction of small sstables,the very slow compression of deflate does not bother us that much.

          The read side is of course still slower with inflate, but it is still more than fast enough to not be a problem.

          OS caching will also be better thanks to the better compression so we can regain some of the performance lost vs. ning/snappy there.

          We could also consider being very tunable with deflate for very large sstables, ning/snappy for smaller and no compression for the smallest, but I am not sure it is worth it.

          By the way, how much difference did you see on ning vs. snappy? When I tested it was not all that much difference and I felt ning was easier to bundle so to me it seemed like a better alternative.

          Show
          Terje Marthinussen added a comment - Instead of on/off we could use size. In the cassandra we run, we have compression implemented on a supercolumn level. It turned out to be very good for performance for us not to compress data in memtables (which we would normally do with compression on supercolumns) or during flushing from memtables as both of these caused slowdown in the write path. Under heavy write activity, the resulting sstables from memtable flushes often gets pretty small (maybe avg. 20MB in our case) so compression does not really make much difference on disk consumption there, but the performance penalty does. All the compression/decompression on compacting the smallest tables also makes a noticable difference when trying to keep up on the compaction side. Instead we went for compression which only happens when a source sstable during compaction is larger than 4GB. I would recommend to consider similar functionality here. I started off with ning for our compression, but I now run the built in java deflate to get even better compression. Since we only compress the largest sstables, and do no other compression in the write path or on compaction of small sstables,the very slow compression of deflate does not bother us that much. The read side is of course still slower with inflate, but it is still more than fast enough to not be a problem. OS caching will also be better thanks to the better compression so we can regain some of the performance lost vs. ning/snappy there. We could also consider being very tunable with deflate for very large sstables, ning/snappy for smaller and no compression for the smallest, but I am not sure it is worth it. By the way, how much difference did you see on ning vs. snappy? When I tested it was not all that much difference and I felt ning was easier to bundle so to me it seemed like a better alternative.
          Hide
          Sylvain Lebresne added a comment -

          On the compression option, we can also have a compression flag and latter add a compression_options (as we do for the compaction_strategy). There is a good change we will end up with more than one option (there is at least the algorithm and chunk size that make sense, maybe Terje idea (though clearly out of the scope for this ticket). Some algorithms may also have tunables between compression and speed).
          Of course, we could also do all of this with a single String (that we would parse server side), or with a mix of both solutions. Seems slightly simpler to just have a flag to turn compression on/off with good default and have options aside for those who care. Just saying, I don't really care so much.

          didn't change visibility of the sync and flush methods because this requires further consideration and should be done in the separate task

          We need to do something for this ticket. Right now, if someone call the public flush or sync methods of CSW mistakenly, it will corrupt data. So we should at least have the public flush and sync throw an UnsupportedOperationException. I'm not saying it will be particularly clean, but I'll take "slightly ugly and safe" over "cleaner but dangerous" anytime. I'm fine with leaving a "cleaner" refactoring of this to a separate task though.

          Other comments:

          • resetAndTruncate is still a problem. What is saved in the mark now is the chunkOffset, that is the offset to the start of the current chunk, but this ignores everything that could be in the buffer at the time the mark is set. So when you resetAndTruncate, you will not go back to the position you were when calling mark, but at the beginning of the chunk.
            The mark should save both the chunkOffset and the bufferOffset. Then resetAndTruncate must seek back to chunkOffset, uncompress the chunk in the buffer and set bufferOffset to what was saved.
            Nit: could be worth optimizing (in SW and CSW) for the case where the mark is in the current buffer. It's not unlikely to be the case given when we use that.
          • In CSW, when resetBuffer is called, current is supposed to either be "aligned" on chunk boundary or we're closing the file. So it seems there is no need to realign bufferOffset, and thus no need to override resetBuffer.
          • The truncateAndClose of CSW doesn't seem to truncate anything. It also doesn't honor skipIOCache correctly since it doesn't call the truncateAndClose of SW. But actually, I think that if the only backward seek we do is through resetAndTruncate, then there is no need to truncate on close (neither for SW nor CSW). So we should probably get rid of that function and move the relevant parts in close().
          • Let's use readUTF/writeUTF to read/write the algorithm name in the metadata file. That's what we use for strings usually (and using a StringBuilder to read a string is a tad over the top).
          • CompressionMetadata.readChunkOffsets() is buggy if the dataLength is a multiple of the chunckLength (we have one less chunk that what's computed then).
            Nit: a simple for(i = 0; i < nbChunks; ++i) loop would be cleaner (you can still capture the EOF and rethrow however).
          • Maybe we should have a few unit tests for all this ? If we could make it so that the tests we already have for compaction and such are run with and without compression that would already be good.

          Other Nits:

          • No need to reset validBufferBytes in CSW.flushData(). It's done in resetBuffer (and not in SW.flushData(), so that'll improve symmetry).
          • Chunk could be a static class in CompressionMetada I suppose.
          Show
          Sylvain Lebresne added a comment - On the compression option, we can also have a compression flag and latter add a compression_options (as we do for the compaction_strategy). There is a good change we will end up with more than one option (there is at least the algorithm and chunk size that make sense, maybe Terje idea (though clearly out of the scope for this ticket). Some algorithms may also have tunables between compression and speed). Of course, we could also do all of this with a single String (that we would parse server side), or with a mix of both solutions. Seems slightly simpler to just have a flag to turn compression on/off with good default and have options aside for those who care. Just saying, I don't really care so much. didn't change visibility of the sync and flush methods because this requires further consideration and should be done in the separate task We need to do something for this ticket. Right now, if someone call the public flush or sync methods of CSW mistakenly, it will corrupt data. So we should at least have the public flush and sync throw an UnsupportedOperationException. I'm not saying it will be particularly clean, but I'll take "slightly ugly and safe" over "cleaner but dangerous" anytime. I'm fine with leaving a "cleaner" refactoring of this to a separate task though. Other comments: resetAndTruncate is still a problem. What is saved in the mark now is the chunkOffset, that is the offset to the start of the current chunk, but this ignores everything that could be in the buffer at the time the mark is set. So when you resetAndTruncate, you will not go back to the position you were when calling mark, but at the beginning of the chunk. The mark should save both the chunkOffset and the bufferOffset. Then resetAndTruncate must seek back to chunkOffset, uncompress the chunk in the buffer and set bufferOffset to what was saved. Nit: could be worth optimizing (in SW and CSW) for the case where the mark is in the current buffer. It's not unlikely to be the case given when we use that. In CSW, when resetBuffer is called, current is supposed to either be "aligned" on chunk boundary or we're closing the file. So it seems there is no need to realign bufferOffset, and thus no need to override resetBuffer. The truncateAndClose of CSW doesn't seem to truncate anything. It also doesn't honor skipIOCache correctly since it doesn't call the truncateAndClose of SW. But actually, I think that if the only backward seek we do is through resetAndTruncate, then there is no need to truncate on close (neither for SW nor CSW). So we should probably get rid of that function and move the relevant parts in close(). Let's use readUTF/writeUTF to read/write the algorithm name in the metadata file. That's what we use for strings usually (and using a StringBuilder to read a string is a tad over the top). CompressionMetadata.readChunkOffsets() is buggy if the dataLength is a multiple of the chunckLength (we have one less chunk that what's computed then). Nit: a simple for(i = 0; i < nbChunks; ++i) loop would be cleaner (you can still capture the EOF and rethrow however). Maybe we should have a few unit tests for all this ? If we could make it so that the tests we already have for compaction and such are run with and without compression that would already be good. Other Nits: No need to reset validBufferBytes in CSW.flushData(). It's done in resetBuffer (and not in SW.flushData(), so that'll improve symmetry). Chunk could be a static class in CompressionMetada I suppose.
          Hide
          Jonathan Ellis added a comment -

          we can also have a compression flag and latter add a compression_options (as we do for the compaction_strategy)

          True – I'd rather make it even more like compaction/replication strategies, and have a compression_class as the "primary" setting (null for none) and compression_options for tuning.

          Show
          Jonathan Ellis added a comment - we can also have a compression flag and latter add a compression_options (as we do for the compaction_strategy) True – I'd rather make it even more like compaction/replication strategies, and have a compression_class as the "primary" setting (null for none) and compression_options for tuning.
          Hide
          Pavel Yaskevich added a comment -

          We need to do something for this ticket. Right now, if someone call the public flush or sync methods of CSW mistakenly, it will corrupt data. So we should at least have the public flush and sync throw an UnsupportedOperationException. I'm not saying it will be particularly clean, but I'll take "slightly ugly and safe" over "cleaner but dangerous" anytime. I'm fine with leaving a "cleaner" refactoring of this to a separate task though.

          CSW sync() and flush() both throw UnsupportedOperationException now.

          resetAndTruncate is still a problem. ...

          Fixed by using information from metadata file (to avoid keeping information about chunks in memory).

          In CSW, when resetBuffer is called, current is supposed to either be "aligned" on chunk boundary or we're closing the file. So it seems there is no need to realign bufferOffset, and thus no need to override resetBuffer.

          Fixed.

          The truncateAndClose of CSW doesn't seem to truncate anything. It also doesn't honor skipIOCache correctly since it doesn't call the truncateAndClose of SW. But actually, I think that if the only backward seek we do is through resetAndTruncate, then there is no need to truncate on close (neither for SW nor CSW). So we should probably get rid of that function and move the relevant parts in close().

          truncateAndClose method is removed, relevant parts moved to close().

          Let's use readUTF/writeUTF to read/write the algorithm name in the metadata file. That's what we use for strings usually (and using a StringBuilder to read a string is a tad over the top).

          Done.

          CompressionMetadata.readChunkOffsets() is buggy if the dataLength is a multiple of the chunckLength (we have one less chunk that what's computed then).

          Fixed but storing information about chunk count in the index file so we no longer need to count anything.

          No need to reset validBufferBytes in CSW.flushData(). It's done in resetBuffer (and not in SW.flushData(), so that'll improve symmetry).

          Fixed.

          Chunk could be a static class in CompressionMetada I suppose.

          Done.

          Show
          Pavel Yaskevich added a comment - We need to do something for this ticket. Right now, if someone call the public flush or sync methods of CSW mistakenly, it will corrupt data. So we should at least have the public flush and sync throw an UnsupportedOperationException. I'm not saying it will be particularly clean, but I'll take "slightly ugly and safe" over "cleaner but dangerous" anytime. I'm fine with leaving a "cleaner" refactoring of this to a separate task though. CSW sync() and flush() both throw UnsupportedOperationException now. resetAndTruncate is still a problem. ... Fixed by using information from metadata file (to avoid keeping information about chunks in memory). In CSW, when resetBuffer is called, current is supposed to either be "aligned" on chunk boundary or we're closing the file. So it seems there is no need to realign bufferOffset, and thus no need to override resetBuffer. Fixed. The truncateAndClose of CSW doesn't seem to truncate anything. It also doesn't honor skipIOCache correctly since it doesn't call the truncateAndClose of SW. But actually, I think that if the only backward seek we do is through resetAndTruncate, then there is no need to truncate on close (neither for SW nor CSW). So we should probably get rid of that function and move the relevant parts in close(). truncateAndClose method is removed, relevant parts moved to close(). Let's use readUTF/writeUTF to read/write the algorithm name in the metadata file. That's what we use for strings usually (and using a StringBuilder to read a string is a tad over the top). Done. CompressionMetadata.readChunkOffsets() is buggy if the dataLength is a multiple of the chunckLength (we have one less chunk that what's computed then). Fixed but storing information about chunk count in the index file so we no longer need to count anything. No need to reset validBufferBytes in CSW.flushData(). It's done in resetBuffer (and not in SW.flushData(), so that'll improve symmetry). Fixed. Chunk could be a static class in CompressionMetada I suppose. Done.
          Hide
          Sylvain Lebresne added a comment -

          I agree with Jonathan on the "let's make the compression option even more like compaction strategies". However, it may be simpler to commit this with the boolean flag and change it in another ticket, when we make the compaction algorithm "pluggable". I'm willing to take the responsibility to make sure that second ticket happens within a few days so that no-one really "sees" the intermediate state where compression is a boolean flag.

          Attaching a v5 that squashes v4 and v4-fixes and made the following changes/additions:

          • Bump thrift version
          • Fix a few places in CFMetaData where compression wasn't handled
          • Fix resetAndTruncate(), both in CSW and SW (in SW, the "reset in current buffer" optimization was broken, and for CSW, bufferOffset and chunkCount weren't reseted correctly, truncation wasn't done at all and the chunk index wasn't reseted nor truncated). Also adds a unit test for resetAndTruncate for both SW and CSW and for both the "reset in current buffer" optimization and without.
          • Slightly clean up CompressionMetadata: we now only keep the dataLengthOffset and write both this and the chunk count at the same time at the end.
          • Fix a problem with the handling of non-compressed sstable in cf where compression is activated. Previous code was expecting the sstables to be compressed as soon as the compression flag was set on the CF. Instead, the compression option is only used during writes to determine if the resulting sstable is compressed. When opening a sstable however, the presence of CompressionInfo is what is used to detect if it is compressed or not.
          • Add a new build target 'ant test-compression' that runs the unit tests with compression turned on for all CF. This is arguably a bit lame, and we should probably do better, but in the meantime I think it's quick and useful.
          • Add support for compression in stress (-I flag (supposed to mean "Inflate" since c, C and i were all taken))

          The changes above are fairly simple or cosmetic (except for resetAndTruncate maybe but there the unit test now), and I think the patch is in a pretty good state so this has my +1 (great work Pavel!). I'll still leave a bit of time for someone else to check those last changes before committing.

          Show
          Sylvain Lebresne added a comment - I agree with Jonathan on the "let's make the compression option even more like compaction strategies". However, it may be simpler to commit this with the boolean flag and change it in another ticket, when we make the compaction algorithm "pluggable". I'm willing to take the responsibility to make sure that second ticket happens within a few days so that no-one really "sees" the intermediate state where compression is a boolean flag. Attaching a v5 that squashes v4 and v4-fixes and made the following changes/additions: Bump thrift version Fix a few places in CFMetaData where compression wasn't handled Fix resetAndTruncate(), both in CSW and SW (in SW, the "reset in current buffer" optimization was broken, and for CSW, bufferOffset and chunkCount weren't reseted correctly, truncation wasn't done at all and the chunk index wasn't reseted nor truncated). Also adds a unit test for resetAndTruncate for both SW and CSW and for both the "reset in current buffer" optimization and without. Slightly clean up CompressionMetadata: we now only keep the dataLengthOffset and write both this and the chunk count at the same time at the end. Fix a problem with the handling of non-compressed sstable in cf where compression is activated. Previous code was expecting the sstables to be compressed as soon as the compression flag was set on the CF. Instead, the compression option is only used during writes to determine if the resulting sstable is compressed. When opening a sstable however, the presence of CompressionInfo is what is used to detect if it is compressed or not. Add a new build target 'ant test-compression' that runs the unit tests with compression turned on for all CF. This is arguably a bit lame, and we should probably do better, but in the meantime I think it's quick and useful. Add support for compression in stress (-I flag (supposed to mean "Inflate" since c, C and i were all taken)) The changes above are fairly simple or cosmetic (except for resetAndTruncate maybe but there the unit test now), and I think the patch is in a pretty good state so this has my +1 (great work Pavel!). I'll still leave a bit of time for someone else to check those last changes before committing.
          Hide
          Pavel Yaskevich added a comment -

          +1

          Show
          Pavel Yaskevich added a comment - +1
          Hide
          Sylvain Lebresne added a comment -

          Committed, thanks.

          Show
          Sylvain Lebresne added a comment - Committed, thanks.
          Hide
          Hudson added a comment -

          Integrated in Cassandra #995 (See https://builds.apache.org/job/Cassandra/995/)
          Add optional compression for sstables
          patch by xedin; reviewed by slebresne for CASSANDRA-47

          slebresne : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1153210
          Files :

          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTable.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableScanner.java
          • /cassandra/trunk/interface/cassandra.thrift
          • /cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/CfDef.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/compress/CompressionMetadata.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/CompactionManager.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableWriter.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/util/SequentialWriter.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/Component.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/util/SegmentedFile.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/compress/CompressedRandomAccessReader.java
          • /cassandra/trunk/src/java/org/apache/cassandra/streaming/FileStreamTask.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/io/sstable/SSTableTest.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/SchemaLoader.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/compress
          • /cassandra/trunk/src/java/org/apache/cassandra/cli/CliClient.java
          • /cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/db/TableTest.java
          • /cassandra/trunk/build.xml
          • /cassandra/trunk/lib/licenses/snappy-java-1.0.3.txt
          • /cassandra/trunk/NOTICE.txt
          • /cassandra/trunk/src/java/org/apache/cassandra/io/compress/CompressedSequentialWriter.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/io/compress
          • /cassandra/trunk/src/java/org/apache/cassandra/streaming/IncomingStreamReader.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/io/compress/CompressedRandomAccessReaderTest.java
          • /cassandra/trunk/src/avro/internode.genavro
          • /cassandra/trunk/src/java/org/apache/cassandra/io/util/CompressedSegmentedFile.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/util/RandomAccessReader.java
          • /cassandra/trunk/lib/snappy-java-1.0.3.jar
          • /cassandra/trunk/tools/stress/src/org/apache/cassandra/stress/Session.java
          • /cassandra/trunk/src/resources/org/apache/cassandra/cli/CliHelp.yaml
          • /cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Constants.java
          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableReader.java
          • /cassandra/trunk/src/java/org/apache/cassandra/config/CFMetaData.java
          Show
          Hudson added a comment - Integrated in Cassandra #995 (See https://builds.apache.org/job/Cassandra/995/ ) Add optional compression for sstables patch by xedin; reviewed by slebresne for CASSANDRA-47 slebresne : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1153210 Files : /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTable.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableScanner.java /cassandra/trunk/interface/cassandra.thrift /cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/CfDef.java /cassandra/trunk/src/java/org/apache/cassandra/io/compress/CompressionMetadata.java /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/CompactionManager.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableWriter.java /cassandra/trunk/src/java/org/apache/cassandra/io/util/SequentialWriter.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/Component.java /cassandra/trunk/src/java/org/apache/cassandra/io/util/SegmentedFile.java /cassandra/trunk/src/java/org/apache/cassandra/io/compress/CompressedRandomAccessReader.java /cassandra/trunk/src/java/org/apache/cassandra/streaming/FileStreamTask.java /cassandra/trunk/test/unit/org/apache/cassandra/io/sstable/SSTableTest.java /cassandra/trunk/test/unit/org/apache/cassandra/SchemaLoader.java /cassandra/trunk/src/java/org/apache/cassandra/io/compress /cassandra/trunk/src/java/org/apache/cassandra/cli/CliClient.java /cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java /cassandra/trunk/test/unit/org/apache/cassandra/db/TableTest.java /cassandra/trunk/build.xml /cassandra/trunk/lib/licenses/snappy-java-1.0.3.txt /cassandra/trunk/NOTICE.txt /cassandra/trunk/src/java/org/apache/cassandra/io/compress/CompressedSequentialWriter.java /cassandra/trunk/test/unit/org/apache/cassandra/io/compress /cassandra/trunk/src/java/org/apache/cassandra/streaming/IncomingStreamReader.java /cassandra/trunk/test/unit/org/apache/cassandra/io/compress/CompressedRandomAccessReaderTest.java /cassandra/trunk/src/avro/internode.genavro /cassandra/trunk/src/java/org/apache/cassandra/io/util/CompressedSegmentedFile.java /cassandra/trunk/src/java/org/apache/cassandra/io/util/RandomAccessReader.java /cassandra/trunk/lib/snappy-java-1.0.3.jar /cassandra/trunk/tools/stress/src/org/apache/cassandra/stress/Session.java /cassandra/trunk/src/resources/org/apache/cassandra/cli/CliHelp.yaml /cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Constants.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableReader.java /cassandra/trunk/src/java/org/apache/cassandra/config/CFMetaData.java

            People

            • Assignee:
              Pavel Yaskevich
              Reporter:
              Jonathan Ellis
              Reviewer:
              Sylvain Lebresne
            • Votes:
              13 Vote for this issue
              Watchers:
              32 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development