Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations.

      This v2 implementation is not ready for serious use: see comments for remaining issues. It is roughly the format described here: http://wiki.apache.org/cassandra/FileFormatDesignDoc

      1. trunk-ycsb.log
        137 kB
        Stu Hood
      2. 674-ycsb.log
        128 kB
        Stu Hood
      3. 674-v3.tgz
        66 kB
        Stu Hood
      4. 674-v2.tgz
        44 kB
        Stu Hood
      5. 674-v1.diff
        252 kB
        Stu Hood

        Issue Links

          Activity

          Hide
          Stu Hood added a comment -

          Is replicate-on-write=disabled why uncompressed went from the highest latency to the lowest?

          Yes: replicate-on-write triggers a huge number of reads, which are much more expensive in trunk, due to 2319 not being included.

          Show
          Stu Hood added a comment - Is replicate-on-write=disabled why uncompressed went from the highest latency to the lowest? Yes: replicate-on-write triggers a huge number of reads, which are much more expensive in trunk, due to 2319 not being included.
          Hide
          Chris Burroughs added a comment -

          Is replicate-on-write=disabled why uncompressed went from the highest latency to the lowest?

          Show
          Chris Burroughs added a comment - Is replicate-on-write=disabled why uncompressed went from the highest latency to the lowest?
          Hide
          Stu Hood added a comment -

          I reran the test mentioned in comment-13054228 with replicate-on-write disabled, which makes for a much more fair comparison (trunk/47 require 2 seeks to miss for a column, and 3 to hit). This version of trunk also includes CASSANDRA-47 snappy compression.

          build disk volume (bytes) bytes per column runtime (s) throughput (ops/s) avg read ms 99th % read ms
          trunk - uncompressed 16,713,328,798 66.8 6154 40620 2.54 6
          trunk - gz 6 * 2,747,319,000 10.98 - - - -
          trunk - snappy 4,356,461,652 17.4 7906 31618 4.64 15
          674+2319 2,675,888,207 10.7 7703 32454 3.04 10

          * trunk - gz 6 is the size of compressing the data directory of the trunk result at GZIP level 6

          In this workload, we're reading from the tail of the row, which means that CASSANDRA-47 needs to decode two blocks per read (one for the row index at the head of the row, and one for the columns at the tail).

          Show
          Stu Hood added a comment - I reran the test mentioned in comment-13054228 with replicate-on-write disabled, which makes for a much more fair comparison (trunk/47 require 2 seeks to miss for a column, and 3 to hit). This version of trunk also includes CASSANDRA-47 snappy compression. build disk volume (bytes) bytes per column runtime (s) throughput (ops/s) avg read ms 99th % read ms trunk - uncompressed 16,713,328,798 66.8 6154 40620 2.54 6 trunk - gz 6 * 2,747,319,000 10.98 - - - - trunk - snappy 4,356,461,652 17.4 7906 31618 4.64 15 674+2319 2,675,888,207 10.7 7703 32454 3.04 10 * trunk - gz 6 is the size of compressing the data directory of the trunk result at GZIP level 6 In this workload, we're reading from the tail of the row, which means that CASSANDRA-47 needs to decode two blocks per read (one for the row index at the head of the row, and one for the columns at the tail).
          Hide
          Stu Hood added a comment -

          I've posted the slightly-divergent branch of YCSB I used for this workload at https://github.com/stuhood/YCSB/tree/monotonic-timeseries

          Show
          Stu Hood added a comment - I've posted the slightly-divergent branch of YCSB I used for this workload at https://github.com/stuhood/YCSB/tree/monotonic-timeseries
          Hide
          Stu Hood added a comment -

          To clarify, I included the "trunk gz 6" result since it is essentially a lower bound for block-based compression. On the other hand, there is some low hanging fruit that could decrease the size of the 674-2319 by another 1 to 1.5 bytes per column.

          Show
          Stu Hood added a comment - To clarify, I included the "trunk gz 6" result since it is essentially a lower bound for block-based compression. On the other hand, there is some low hanging fruit that could decrease the size of the 674-2319 by another 1 to 1.5 bytes per column.
          Hide
          Stu Hood added a comment - - edited

          I've finished rebasing 2319 onto 674 (to gain back the wide-row random access performance we lost here by removing the row index). Nothing has changed in 674 (still v3), but 2319 was simplified considerably by not having to deal with Lazily vs Pre CompactedRows.

          I'm also attaching some performance numbers for a wide row timeseries usecase. The workload was:

          • 10000 rows
          • 250 million columns (randomly across the 10000 rows)
          • 99% writes (appends), 1% reads (from tail of row)
          • LongType column names, CounterColumnType column values
          • Custom YCSB workload using a monotonically increasing long for column name AND value: on average, it will increase by N for every new column in a row for N rows
          • (8G Linux cgroup) - (6G JVM heap) ~= (2G of Linux page cache)

          Result summary:

          build disk volume (bytes) bytes per column runtime (s) throughput (ops/s) 50th % read ms 99th % read ms
          trunk 16,716,432,189 66.8 8620 29001 9.161 154
          trunk gz 6 * 2,747,319,000 10.98        
          674+2319 2,375,027,696 9.5 7939 31488 2.444 20

          * "trunk gz 6" is the size of compressing the data directory of the trunk result at GZIP level 6

          I would love to work with someone in the community to review this branch: I feel comfortable that the remaining issues could be worked out after commit, but I'm willing to do anything it takes before merge. Internally, we're working to deploy this branch within the next few weeks.

          Show
          Stu Hood added a comment - - edited I've finished rebasing 2319 onto 674 (to gain back the wide-row random access performance we lost here by removing the row index). Nothing has changed in 674 (still v3), but 2319 was simplified considerably by not having to deal with Lazily vs Pre CompactedRows. I'm also attaching some performance numbers for a wide row timeseries usecase. The workload was: 10000 rows 250 million columns (randomly across the 10000 rows) 99% writes (appends), 1% reads (from tail of row) LongType column names, CounterColumnType column values Custom YCSB workload using a monotonically increasing long for column name AND value: on average, it will increase by N for every new column in a row for N rows (8G Linux cgroup) - (6G JVM heap) ~= (2G of Linux page cache) Result summary: build disk volume (bytes) bytes per column runtime (s) throughput (ops/s) 50th % read ms 99th % read ms trunk 16,716,432,189 66.8 8620 29001 9.161 154 trunk gz 6 * 2,747,319,000 10.98         674+2319 2,375,027,696 9.5 7939 31488 2.444 20 * "trunk gz 6" is the size of compressing the data directory of the trunk result at GZIP level 6 I would love to work with someone in the community to review this branch: I feel comfortable that the remaining issues could be worked out after commit, but I'm willing to do anything it takes before merge. Internally, we're working to deploy this branch within the next few weeks.
          Hide
          Stu Hood added a comment - - edited

          Attaching a new version: v3. I've extracted most of the tasks that can be accomplished independently into other tickets. For testing purposes, it may be easiest to review the commits on Github.

          Changes from v2:

          • Removed Avro
          • Added block checksumming
          • Switched to type specific compression via CASSANDRA-2398
          • Used type-specific compression for timestamps
          • Implemented supercolumn support

          This revision compresses wide rows very well, and the datafile format is essentially finalized. Next steps are to improve the performance at read time:

          1. Incorporate CASSANDRA-2319 to improve wide row access times
            • Since the patch removes the column index, reads always begin at the beginning of the row, and scan until the correct column range is found. 2319 would allow for random access to a block
          2. Store more than one row per block in order to take the best advantage of compression for narrow rows
            • This patch adds a Cursor object, which represents the position in a block and file. SSTableScanner will need to hold a Cursor between rows, and pass it into each IColumnIterator that is created
          Show
          Stu Hood added a comment - - edited Attaching a new version: v3. I've extracted most of the tasks that can be accomplished independently into other tickets . For testing purposes, it may be easiest to review the commits on Github . Changes from v2: Removed Avro Added block checksumming Switched to type specific compression via CASSANDRA-2398 Used type-specific compression for timestamps Implemented supercolumn support This revision compresses wide rows very well, and the datafile format is essentially finalized. Next steps are to improve the performance at read time: Incorporate CASSANDRA-2319 to improve wide row access times Since the patch removes the column index, reads always begin at the beginning of the row, and scan until the correct column range is found. 2319 would allow for random access to a block Store more than one row per block in order to take the best advantage of compression for narrow rows This patch adds a Cursor object, which represents the position in a block and file. SSTableScanner will need to hold a Cursor between rows, and pass it into each IColumnIterator that is created
          Hide
          Stu Hood added a comment -

          Linking to issues that I extracted from this patchset.

          Show
          Stu Hood added a comment - Linking to issues that I extracted from this patchset.
          Hide
          Stu Hood added a comment -

          One of the key blockers is implementing rebuilding of SSTables post-streaming. Based on an IRC conversation yesterday, the smoothest way to support streaming of older SSTable versions was to ABC and subclass what is now the SSTableWrite.Builder object: I'll probably try to do this in a separate ticket.

          Show
          Stu Hood added a comment - One of the key blockers is implementing rebuilding of SSTables post-streaming. Based on an IRC conversation yesterday, the smoothest way to support streaming of older SSTable versions was to ABC and subclass what is now the SSTableWrite.Builder object: I'll probably try to do this in a separate ticket.
          Hide
          Stu Hood added a comment - - edited

          Attaching a snapshot of the work on this new design for the file format. There are a few important blockers not implemented, as evidenced by failing tests, but I feel confident that serialization and format themselves are solid.

          Please carefully note the caveats listed under v2 on the design doc: http://wiki.apache.org/cassandra/FileFormatDesignDoc#v2 In particular, we will barely see the benefits of compression with this implementation, because Spans are storing at most 1 row. Storing more than one row in a span will require changes to the sstable index (awesome changes), and I'd feel much better doing it once this is in trunk.

          EDIT: Also, this necessarily applies atop CASSANDRA-2062.

          Show
          Stu Hood added a comment - - edited Attaching a snapshot of the work on this new design for the file format. There are a few important blockers not implemented, as evidenced by failing tests, but I feel confident that serialization and format themselves are solid. Please carefully note the caveats listed under v2 on the design doc: http://wiki.apache.org/cassandra/FileFormatDesignDoc#v2 In particular, we will barely see the benefits of compression with this implementation, because Spans are storing at most 1 row. Storing more than one row in a span will require changes to the sstable index (awesome changes), and I'd feel much better doing it once this is in trunk. EDIT: Also, this necessarily applies atop CASSANDRA-2062 .
          Hide
          Stu Hood added a comment -

          Posting a new description; trusting JIRA to preserve the original for posterity.

          Show
          Stu Hood added a comment - Posting a new description; trusting JIRA to preserve the original for posterity.
          Hide
          Jonathan Ellis added a comment -

          Here is an interesting paper on a way to get both good inter-record and intra-record data locality: http://scholar.google.com/scholar?q=A+Storage+Model+to+Bridge+the+Processor/Memory+Speed+Gap.

          Not sure how to apply that to an arbitrarily-large-rows model like ours tho.

          Show
          Jonathan Ellis added a comment - Here is an interesting paper on a way to get both good inter-record and intra-record data locality: http://scholar.google.com/scholar?q=A+Storage+Model+to+Bridge+the+Processor/Memory+Speed+Gap . Not sure how to apply that to an arbitrarily-large-rows model like ours tho.
          Hide
          Stu Hood added a comment -

          > If we assume we keep the datamodel as is how can we simplify the open ended-ness of your design to make the approach fit our current data model.
          To keep this from becoming a point of contention, I'll remove that goal from the design doc: the design so far has this feature as a side effect though.

          Show
          Stu Hood added a comment - > If we assume we keep the datamodel as is how can we simplify the open ended-ness of your design to make the approach fit our current data model. To keep this from becoming a point of contention, I'll remove that goal from the design doc: the design so far has this feature as a side effect though.
          Hide
          Stu Hood added a comment - - edited

          >> Indexes for individual rows are gone, since the global index allows random access...
          > ^ This wouldn't be useful to cache? in the situation you only want a small range of columns?
          That information is outdated: it's from the original implementation. But yes... we will want to keep the index in app memory or page cache.

          > Roughly how large would the actual chunk be? This is the unit of deserialization right?
          The span is the unit of deserialization (made up of at most 1 chunk per level), and its size would be 100% configurable. The main question is how frequently to index the spans in the sstable index: does each span get an index entry? or only the first span of a row (this is our approach in the current implementation).

          EDIT: Sorry... the span is symbolic: you would deserialize the first chunk of the span (containing the keys) to decide whether to skip the rest of the chunks in the span.

          > So if you are doing a range query on a very wide row how do you know when to stop processing chunks?
          By looking at the global index: if all spans get entries in the index, you know the last interesting span.

          > Let me know if this is wrong, but this design opens the cassandra data model to contain arbitrarily nested data.
          > Given the complexity we already have surrounding the supercolumn concept do you think this is the right way forward?
          The super column concept is only confusing because we call them "supercolumns" rather than just calling them "compound column names". People use them, and the consensus I've heard is that they are useful.

          > If we assume we keep the datamodel as is how can we simplify the open ended-ness of your design to make the approach fit our current data model.
          The only difference is what you call the structures, and whether you put arbitrary limits on the nesting: I'm open to suggestions.

          Show
          Stu Hood added a comment - - edited >> Indexes for individual rows are gone, since the global index allows random access... > ^ This wouldn't be useful to cache? in the situation you only want a small range of columns? That information is outdated: it's from the original implementation. But yes... we will want to keep the index in app memory or page cache. > Roughly how large would the actual chunk be? This is the unit of deserialization right? The span is the unit of deserialization (made up of at most 1 chunk per level), and its size would be 100% configurable. The main question is how frequently to index the spans in the sstable index: does each span get an index entry? or only the first span of a row (this is our approach in the current implementation). EDIT: Sorry... the span is symbolic: you would deserialize the first chunk of the span (containing the keys) to decide whether to skip the rest of the chunks in the span. > So if you are doing a range query on a very wide row how do you know when to stop processing chunks? By looking at the global index: if all spans get entries in the index, you know the last interesting span. > Let me know if this is wrong, but this design opens the cassandra data model to contain arbitrarily nested data. > Given the complexity we already have surrounding the supercolumn concept do you think this is the right way forward? The super column concept is only confusing because we call them "supercolumns" rather than just calling them "compound column names". People use them, and the consensus I've heard is that they are useful. > If we assume we keep the datamodel as is how can we simplify the open ended-ness of your design to make the approach fit our current data model. The only difference is what you call the structures, and whether you put arbitrary limits on the nesting: I'm open to suggestions.
          Hide
          T Jake Luciani added a comment -

          Let me know if this is wrong, but this design opens the cassandra data model to contain arbitrarily nested data.

          Given the complexity we already have surrounding the supercolumn concept do you think this is the right way forward?
          As much as my inner geek wants to build a tree or graph model I don't think the C* community or committers want to take it this way.

          If we assume we keep the datamodel as is how can we simplify the open ended-ness of your design to make the approach fit our current data model.

          Show
          T Jake Luciani added a comment - Let me know if this is wrong, but this design opens the cassandra data model to contain arbitrarily nested data. Given the complexity we already have surrounding the supercolumn concept do you think this is the right way forward? As much as my inner geek wants to build a tree or graph model I don't think the C* community or committers want to take it this way. If we assume we keep the datamodel as is how can we simplify the open ended-ness of your design to make the approach fit our current data model.
          Hide
          T Jake Luciani added a comment -

          the metadata is useless on it's own. It only becomes useful when it is attached to data (a column or to a range), so there is no reason to cache the meta- independently of the data.

          But above you mention:

          Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block.
          

          ^ This wouldn't be useful to cache? in the situation you only want a small range of columns?

          ----- More questions ----
          Roughly how large would the actual chunk be? This is the unit of deserialization right? or can avro deserialize only part of a structure?

          So if you are doing a range query on a very wide row how do you know when to stop processing chunks? do you keep going till you hit the sentinel value <empty> ?

          Show
          T Jake Luciani added a comment - the metadata is useless on it's own. It only becomes useful when it is attached to data (a column or to a range), so there is no reason to cache the meta- independently of the data. But above you mention: Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block. ^ This wouldn't be useful to cache? in the situation you only want a small range of columns? ----- More questions ---- Roughly how large would the actual chunk be? This is the unit of deserialization right? or can avro deserialize only part of a structure? So if you are doing a range query on a very wide row how do you know when to stop processing chunks? do you keep going till you hit the sentinel value <empty> ?
          Hide
          Stu Hood added a comment -

          > How will ranges be stored? The parent ordering would mean the sorting of data at that level is lost no?
          Added some explanation of how I think ranges should work to the wiki. http://wiki.apache.org/cassandra/FileFormatDesignDoc?action=diff&rev1=15&rev2=16

          > Are chunks broken up by size only?
          Technically "spans" are the largest unit, so they define the boundaries: tried to clarify this part as well. There are a few possible thresholds, including a max number of rows, columns, range tombstones or total bytes in the span.

          One semi-undefined portion is what happens when a row is larger than can be stuffed in a span. Most likely we'll want to use the range metadata to indicate the portion of the row covered by the span (the approach I took in the original implementation attached here).

          > Will the metadata be ripe for caching?
          I don't think so: the metadata is useless on it's own. It only becomes useful when it is attached to data (a column or to a range), so there is no reason to cache the meta- independently of the data.

          Thanks!

          Show
          Stu Hood added a comment - > How will ranges be stored? The parent ordering would mean the sorting of data at that level is lost no? Added some explanation of how I think ranges should work to the wiki. http://wiki.apache.org/cassandra/FileFormatDesignDoc?action=diff&rev1=15&rev2=16 > Are chunks broken up by size only? Technically "spans" are the largest unit, so they define the boundaries: tried to clarify this part as well. There are a few possible thresholds, including a max number of rows, columns, range tombstones or total bytes in the span. One semi-undefined portion is what happens when a row is larger than can be stuffed in a span. Most likely we'll want to use the range metadata to indicate the portion of the row covered by the span (the approach I took in the original implementation attached here). > Will the metadata be ripe for caching? I don't think so: the metadata is useless on it's own. It only becomes useful when it is attached to data (a column or to a range), so there is no reason to cache the meta- independently of the data. Thanks!
          Hide
          T Jake Luciani added a comment -

          As I try to wrap my head around this I'm listing questions that come to mind:

          • How will ranges be stored? The parent ordering would mean the sorting of data at that level is lost no?
          • Are chunks broken up by size only?
          • Will the metadata be ripe for caching?
          Show
          T Jake Luciani added a comment - As I try to wrap my head around this I'm listing questions that come to mind: How will ranges be stored? The parent ordering would mean the sorting of data at that level is lost no? Are chunks broken up by size only? Will the metadata be ripe for caching?
          Hide
          Stu Hood added a comment -

          Linking CASSANDRA-494, since they require range tombstones.

          Show
          Stu Hood added a comment - Linking CASSANDRA-494 , since they require range tombstones.
          Hide
          Stu Hood added a comment -

          Thinking about this issue again. Dumped some thoughts I had on paper to the wiki: http://wiki.apache.org/cassandra/FileFormatDesignDoc .

          Show
          Stu Hood added a comment - Thinking about this issue again. Dumped some thoughts I had on paper to the wiki: http://wiki.apache.org/cassandra/FileFormatDesignDoc .
          Hide
          Ryan King added a comment -

          YES!

          Show
          Ryan King added a comment - YES!
          Hide
          Stu Hood added a comment -

          After having played with Avro a bit more, I'm all for using its DataFile format in the SSTable. The variable length integer encoding, built in compression, schema migration and block recovery schemes are win.

          Show
          Stu Hood added a comment - After having played with Avro a bit more, I'm all for using its DataFile format in the SSTable. The variable length integer encoding, built in compression, schema migration and block recovery schemes are win.
          Hide
          Jonathan Ellis added a comment -

          Let's push data format change to 0.8. I'm burned out on it, and nobody else is stepping up to help review.

          We can fix CASSANDRA-16 with the current data format for 0.7.

          Show
          Jonathan Ellis added a comment - Let's push data format change to 0.8. I'm burned out on it, and nobody else is stepping up to help review. We can fix CASSANDRA-16 with the current data format for 0.7.
          Hide
          David Strauss added a comment -

          @jbellis Sorry, it seems that I was confusing the commit logs (where there can only be one receiving writes on each node to avoid seeks) with the SSTable files (where multiple ones may be receiving writes on each node).

          Show
          David Strauss added a comment - @jbellis Sorry, it seems that I was confusing the commit logs (where there can only be one receiving writes on each node to avoid seeks) with the SSTable files (where multiple ones may be receiving writes on each node).
          Hide
          Jonathan Ellis added a comment -

          there's no such thing as "the oldest sstable," and even if there were, there is no way to know which columns need to increase the count without actually doing the full merge as we do currently.

          consider a hypothetical oldest sstable with a row whose count you have set to 10. there is another sstable fragment with column A in that row. is A an update to the original 10, or a new insert? you have no way of knowing.

          "count is slow" is one of the tradeoffs we make for having super fast writes (no update-in-place) and snapshotting. it's the right tradeoff, but there's no magic wand to make it a free lunch.

          Show
          Jonathan Ellis added a comment - there's no such thing as "the oldest sstable," and even if there were, there is no way to know which columns need to increase the count without actually doing the full merge as we do currently. consider a hypothetical oldest sstable with a row whose count you have set to 10. there is another sstable fragment with column A in that row. is A an update to the original 10, or a new insert? you have no way of knowing. "count is slow" is one of the tradeoffs we make for having super fast writes (no update-in-place) and snapshotting. it's the right tradeoff, but there's no magic wand to make it a free lunch.
          Hide
          David Strauss added a comment -

          This is a good opportunity to improve get_count() performance. Currently, it is O at call-time, where n is the number of columns being counted. I discussed the issue with Stu on IRC, and he mentioned how a "mini-merge" happens at call-time for the SSTables storing data for a column making it difficult to maintain counts.

          Instead of counting all columns, we could maintain and use column counts in the oldest SSTable and "repair" the relevant counts at get_count() call-time with the changes found in the newer SSTables. That would allow calls to get_count() to run in O(m) time, where m is the number of columns being counted in all but the oldest SSTable. (Granted, m can approach n on high write volume, but m can never exceed n.)

          For stable data, this would bring get_count() to near constant-time with performance gradually degrading depending on the number of non-oldest SSTables.

          (Note: I'm probably missing a multiplier in my big-O notation for looking up columns in older SSTables to detect intersections.)

          Show
          David Strauss added a comment - This is a good opportunity to improve get_count() performance. Currently, it is O at call-time, where n is the number of columns being counted. I discussed the issue with Stu on IRC, and he mentioned how a "mini-merge" happens at call-time for the SSTables storing data for a column making it difficult to maintain counts. Instead of counting all columns, we could maintain and use column counts in the oldest SSTable and "repair" the relevant counts at get_count() call-time with the changes found in the newer SSTables. That would allow calls to get_count() to run in O(m) time, where m is the number of columns being counted in all but the oldest SSTable . (Granted, m can approach n on high write volume, but m can never exceed n.) For stable data, this would bring get_count() to near constant-time with performance gradually degrading depending on the number of non-oldest SSTables. (Note: I'm probably missing a multiplier in my big-O notation for looking up columns in older SSTables to detect intersections.)
          Hide
          Stu Hood added a comment -

          > i could go for a "put all the metadata at the head of the block, rest of the block is just name value timestamp, name value timestamp... "
          Similarly, the RCFile design from Hive stores all keys at the head of a block: http://hadoop.apache.org/hive/docs/r0.4.0/api/org/apache/hadoop/hive/ql/io/RCFile.html . I don't know if we should go so far as supporting arbitrary compression per column family, but making the data easier for a generic compression algo to squish is a nice side effect.

          Show
          Stu Hood added a comment - > i could go for a "put all the metadata at the head of the block, rest of the block is just name value timestamp, name value timestamp... " Similarly, the RCFile design from Hive stores all keys at the head of a block: http://hadoop.apache.org/hive/docs/r0.4.0/api/org/apache/hadoop/hive/ql/io/RCFile.html . I don't know if we should go so far as supporting arbitrary compression per column family, but making the data easier for a generic compression algo to squish is a nice side effect.
          Hide
          Jonathan Ellis added a comment -

          i could go for a "put all the metadata at the head of the block, rest of the block is just name value timestamp, name value timestamp... " design. then you'd have a block index file for as-close-to-random-access-as-you're-gonna-get, a duplicate of block headers in a 2nd file for redundancy, and probably a key-oriented BF file.

          Show
          Jonathan Ellis added a comment - i could go for a "put all the metadata at the head of the block, rest of the block is just name value timestamp, name value timestamp... " design. then you'd have a block index file for as-close-to-random-access-as-you're-gonna-get, a duplicate of block headers in a 2nd file for redundancy, and probably a key-oriented BF file.
          Hide
          Stu Hood added a comment -

          > it would be easier to skip from one group to another w/ the "slice" indexes next to each other
          Since the block might be compressed, you can't assume random access to the whole data file: you might have to scan the block from the beginning anyway. So indexing externally to the data file at a resolution higher than blocks is of questionable value.

          Show
          Stu Hood added a comment - > it would be easier to skip from one group to another w/ the "slice" indexes next to each other Since the block might be compressed, you can't assume random access to the whole data file: you might have to scan the block from the beginning anyway. So indexing externally to the data file at a resolution higher than blocks is of questionable value.
          Hide
          Jonathan Ellis added a comment -

          another argument in favor of using external indexes instead of in-file "slices" would be CASSANDRA-571 – it would be easier to skip from one group to another w/ the "slice" indexes next to each other on disk instead of scattered through the data file.

          Show
          Jonathan Ellis added a comment - another argument in favor of using external indexes instead of in-file "slices" would be CASSANDRA-571 – it would be easier to skip from one group to another w/ the "slice" indexes next to each other on disk instead of scattered through the data file.
          Hide
          Stu Hood added a comment -

          I've extracted the current interfaces for SSTableReader and SSTableScanner, and I'm going to start modifying the interfaces to be closer to the original 674-v1 patch, which should take a week or so. Then, if everyone is happy with the outcome and satisfied that we'll be able to maintain the interface for a few versions, we can get that interface merged and start thinking about the format again.

          Show
          Stu Hood added a comment - I've extracted the current interfaces for SSTableReader and SSTableScanner, and I'm going to start modifying the interfaces to be closer to the original 674-v1 patch, which should take a week or so. Then, if everyone is happy with the outcome and satisfied that we'll be able to maintain the interface for a few versions, we can get that interface merged and start thinking about the format again.
          Hide
          Jonathan Ellis added a comment -

          But if we're just going to use the Avro format as a "container" for non-avro data i don't see the point.

          Show
          Jonathan Ellis added a comment - But if we're just going to use the Avro format as a "container" for non-avro data i don't see the point.
          Hide
          Kevin Weil added a comment -

          I haven't worked enough with Avro to be sure, but my understanding is that the metadata block can be made pretty lightweight. It's more for Avro schema resolution than trying to minimize number of files, as I understand it. It'd be nice if you could even instruct Avro not to put the schema in the metadata for known, generated schemas, though I don't know if that's possible or not. Either way, it doesn't mandate the indices be stored in the metadata. Agreed that things are nicer when different types of data are in their own files.

          Show
          Kevin Weil added a comment - I haven't worked enough with Avro to be sure, but my understanding is that the metadata block can be made pretty lightweight. It's more for Avro schema resolution than trying to minimize number of files, as I understand it. It'd be nice if you could even instruct Avro not to put the schema in the metadata for known, generated schemas, though I don't know if that's possible or not. Either way, it doesn't mandate the indices be stored in the metadata. Agreed that things are nicer when different types of data are in their own files.
          Hide
          Jonathan Ellis added a comment -

          unfortunately but not suprisingly, the avro format is a poor fit for cassandra. it appears to be designed for hdfs, where having multiple files is expensive, so metadata (such as cassandra indexes) is stored in the same file as object data, after the normal blocks it describes.

          this is how cassandra did things back in 0.3, following the bigtable model, and it is lousy for us because you have to save up the index in memory as you write data out; since cassandra sstables are not bounded, you can easily OOM doing this, which is why in 0.4 we moved to a separate index file. (additionally, the code is simpler and cleaner when you split different types of data into its own file.)

          Show
          Jonathan Ellis added a comment - unfortunately but not suprisingly, the avro format is a poor fit for cassandra. it appears to be designed for hdfs, where having multiple files is expensive, so metadata (such as cassandra indexes) is stored in the same file as object data, after the normal blocks it describes. this is how cassandra did things back in 0.3, following the bigtable model, and it is lousy for us because you have to save up the index in memory as you write data out; since cassandra sstables are not bounded, you can easily OOM doing this, which is why in 0.4 we moved to a separate index file. (additionally, the code is simpler and cleaner when you split different types of data into its own file.)
          Hide
          Kevin Weil added a comment -

          I'm with Ryan. Clearly there is a huge caveat because I'm much more of a Hadoop dev than a Cassandra dev. I'm not at all suggesting that Cassandra should bend over backward to fit another system, but if there is a way to nudge things so as to make technologies work together, I think that's to everyone's benefit. Hadoop users will have a more straightforward path to Cassandra adoption and vice versa. Allowing the two technologies to leverage each other's strengths would be a great thing.

          Show
          Kevin Weil added a comment - I'm with Ryan. Clearly there is a huge caveat because I'm much more of a Hadoop dev than a Cassandra dev. I'm not at all suggesting that Cassandra should bend over backward to fit another system, but if there is a way to nudge things so as to make technologies work together, I think that's to everyone's benefit. Hadoop users will have a more straightforward path to Cassandra adoption and vice versa. Allowing the two technologies to leverage each other's strengths would be a great thing.
          Hide
          Ryan King added a comment -

          This is why the sync markers from avro would be useful. If you bitrot, you'll only lose the block with the rot in it.

          Show
          Ryan King added a comment - This is why the sync markers from avro would be useful. If you bitrot, you'll only lose the block with the rot in it.
          Hide
          Jonathan Ellis added a comment -

          Another advantage of having an external index-like structure containing redundant information to block headers: if bitrot corrupts a block header we can still recover.

          Show
          Jonathan Ellis added a comment - Another advantage of having an external index-like structure containing redundant information to block headers: if bitrot corrupts a block header we can still recover.
          Hide
          Jonathan Ellis added a comment -

          (actually we already are limited to 64K since we are using writeUTF. but i dont' think we are enforcing that limit at the thrift level)

          Show
          Jonathan Ellis added a comment - (actually we already are limited to 64K since we are using writeUTF. but i dont' think we are enforcing that limit at the thrift level)
          Hide
          Jonathan Ellis added a comment - - edited

          mental note: when we change sstable format, let's take advantage of the opportunity to restrict key lenghts to 64K (i.e., 16 bits)

          edit: our use of writeUTF is already silently enforcing this. I added a check to our thrift validation to raise an intelligible error to the user if a longer key is sent.

          Show
          Jonathan Ellis added a comment - - edited mental note: when we change sstable format, let's take advantage of the opportunity to restrict key lenghts to 64K (i.e., 16 bits) edit: our use of writeUTF is already silently enforcing this. I added a check to our thrift validation to raise an intelligible error to the user if a longer key is sent.
          Hide
          Ryan King added a comment -

          stu-

          I understand that the overlap is coincidental, I'm just hoping to encourage cooperation where possible. I certainly have a personal bias here, because I'd like to move our infrastructure to using a common data serialization across our online (casandra) and offline (hadoop) storage. That's not to say that we couldn't make the integration work, but it seems like some awesome things could happen when everyone is using the same data format.

          Show
          Ryan King added a comment - stu- I understand that the overlap is coincidental, I'm just hoping to encourage cooperation where possible. I certainly have a personal bias here, because I'd like to move our infrastructure to using a common data serialization across our online (casandra) and offline (hadoop) storage. That's not to say that we couldn't make the integration work, but it seems like some awesome things could happen when everyone is using the same data format.
          Hide
          Jonathan Ellis added a comment -

          > While I don't like the idea of a separate file being necessary in order to read from the 'data' file of the sstable

          And I don't like the idea of the data file containing lots of weird speed bumps of headers and such that aren't actually data.

          Show
          Jonathan Ellis added a comment - > While I don't like the idea of a separate file being necessary in order to read from the 'data' file of the sstable And I don't like the idea of the data file containing lots of weird speed bumps of headers and such that aren't actually data.
          Hide
          Stu Hood added a comment -

          kingryan:
          >... it seems that there's a degree of overlap with Avro's object container files...
          Purely coincidental, I assure you... There might be some benefits in conforming to their standard (we would get streaming support in Hadoop for free), but we need versioning at the SSTReader/Writer level anyway, so versioning within the file is overkill, and I'm fairly sure that the binary serialization we do here will be noticeable faster than Avro.

          Adding that magic sync marker seems like a good idea though.

          Show
          Stu Hood added a comment - kingryan: >... it seems that there's a degree of overlap with Avro's object container files... Purely coincidental, I assure you... There might be some benefits in conforming to their standard (we would get streaming support in Hadoop for free), but we need versioning at the SSTReader/Writer level anyway, so versioning within the file is overkill, and I'm fairly sure that the binary serialization we do here will be noticeable faster than Avro. Adding that magic sync marker seems like a good idea though.
          Hide
          Stu Hood added a comment -

          jbellis:
          > I would rather see the "things with the same parent' concept be an iterator, with metadata from a separate
          > file (like the current key index) used to determine begin/end
          I couldn't sleep due to timezone changes, and this suggestion kept jumping into my head. While I don't like the idea of a separate file being necessary in order to read from the 'data' file of the sstable (currently, it stands alone: the index and filter files are optimizations), I think moving all of the location information out of the SliceMarks is a good idea.

          To imitate the implementation of indexes in trunk, perhaps the Block in 674-v1 becomes the unit that has it's own 'index' (so to speak): the first thing you see when you open the block is the list of Slices contained in the block. Naively, this would be a list of SliceMarks with indexes into the block, but because the Slice information is all stored contiguously, you can optimize it considerably (no need for 'nextKey', and all consecutive slices that share any parents only need those parent keys stored once). Then, following the 'index' for the block, the remainder of the block would just be consecutive columns.

          > But that is just a first impression I am throwing out fwiw.
          Agreed... I always seem to tend toward waterfall, and it hurts in the long run.

          Show
          Stu Hood added a comment - jbellis: > I would rather see the "things with the same parent' concept be an iterator, with metadata from a separate > file (like the current key index) used to determine begin/end I couldn't sleep due to timezone changes, and this suggestion kept jumping into my head. While I don't like the idea of a separate file being necessary in order to read from the 'data' file of the sstable (currently, it stands alone: the index and filter files are optimizations), I think moving all of the location information out of the SliceMarks is a good idea. To imitate the implementation of indexes in trunk, perhaps the Block in 674-v1 becomes the unit that has it's own 'index' (so to speak): the first thing you see when you open the block is the list of Slices contained in the block. Naively, this would be a list of SliceMarks with indexes into the block, but because the Slice information is all stored contiguously, you can optimize it considerably (no need for 'nextKey', and all consecutive slices that share any parents only need those parent keys stored once). Then, following the 'index' for the block, the remainder of the block would just be consecutive columns. > But that is just a first impression I am throwing out fwiw. Agreed... I always seem to tend toward waterfall, and it hurts in the long run.
          Hide
          Ryan King added a comment -

          I haven't had a chance to look through this patch very closely, so forgive me if this is a dumb suggestion, but it seems that there's a degree of overlap with Avro's object container files: http://hadoop.apache.org/avro/docs/current/spec.html#Object+Container+Files. Have we looked at those at all?

          Show
          Ryan King added a comment - I haven't had a chance to look through this patch very closely, so forgive me if this is a dumb suggestion, but it seems that there's a degree of overlap with Avro's object container files: http://hadoop.apache.org/avro/docs/current/spec.html#Object+Container+Files . Have we looked at those at all?
          Hide
          Jonathan Ellis added a comment -

          ISTM that Slice is trying to solve the problem "how do I avoid repeating the Key/SC name w/ each column entry, now that I have moved to a global index." This is the central difficulty with this approach. So, I definitely agree that we need a concept that means "all the columns w/ the same parent" (sort of like the existing IColumnContainer) but I don't think Slice as it exists here is the right one. I would rather see the "things with the same parent' concept be an iterator, with metadata from a separate file (like the current key index) used to determine begin/end, rather than have an object inside a block that you need to (potentially) assemble multiple of to get the "things with the same parent" concept.

          I also think that if I were doing this myself I would probably make part 1 be a conversion to the global index and just inefficiently repeat the Key/SC data, and then try to make it efficient with the Slice/iterator-thing next. But that is just a first impression I am throwing out fwiw.

          Show
          Jonathan Ellis added a comment - ISTM that Slice is trying to solve the problem "how do I avoid repeating the Key/SC name w/ each column entry, now that I have moved to a global index." This is the central difficulty with this approach. So, I definitely agree that we need a concept that means "all the columns w/ the same parent" (sort of like the existing IColumnContainer) but I don't think Slice as it exists here is the right one. I would rather see the "things with the same parent' concept be an iterator, with metadata from a separate file (like the current key index) used to determine begin/end, rather than have an object inside a block that you need to (potentially) assemble multiple of to get the "things with the same parent" concept. I also think that if I were doing this myself I would probably make part 1 be a conversion to the global index and just inefficiently repeat the Key/SC data, and then try to make it efficient with the Slice/iterator-thing next. But that is just a first impression I am throwing out fwiw.
          Hide
          Stu Hood added a comment - - edited

          I'm marking this one as blocked by 389, because we had a good head start on adding backward compatible sstable versioning there.

          Once versioning is merged, next steps will be extracting abstract base classes for SSTableReader and SSTableScanner, and extending them with the SSTable format in trunk, and the format in 674-v1.

          Show
          Stu Hood added a comment - - edited I'm marking this one as blocked by 389, because we had a good head start on adding backward compatible sstable versioning there. Once versioning is merged, next steps will be extracting abstract base classes for SSTableReader and SSTableScanner, and extending them with the SSTable format in trunk, and the format in 674-v1.
          Hide
          Stu Hood added a comment -

          The one major compaction that typically triggers while inserting 1million items fails immediately with this code: see #4 in the comments. So, if that major compaction succeeded, writes would probably be slower, and reads would be faster.

          Show
          Stu Hood added a comment - The one major compaction that typically triggers while inserting 1million items fails immediately with this code: see #4 in the comments. So, if that major compaction succeeded, writes would probably be slower, and reads would be faster.
          Hide
          Jonathan Ellis added a comment -

          if you're seeing exactly equal times on insert (is that what this says?) you're probably not doing enough compactions

          Show
          Jonathan Ellis added a comment - if you're seeing exactly equal times on insert (is that what this says?) you're probably not doing enough compactions
          Hide
          Stu Hood added a comment -

          Here are stress.py runs of current trunk (default config), and 674-v1 applied to trunk with data file mmap support disabled. It should be possible to make this code competitive with trunk once mmap is added back.

          Show
          Stu Hood added a comment - Here are stress.py runs of current trunk (default config), and 674-v1 applied to trunk with data file mmap support disabled. It should be possible to make this code competitive with trunk once mmap is added back.
          Hide
          Jonathan Ellis added a comment -

          What kind of performance do you get on reads and writes with stress.py vs the old code? (without compression, to compare apples to apples)

          Note that stress.py uses very narrow rows so it's pretty much a best-case scenario for this approach; we should test with much wider rows, too.

          Show
          Jonathan Ellis added a comment - What kind of performance do you get on reads and writes with stress.py vs the old code? (without compression, to compare apples to apples) Note that stress.py uses very narrow rows so it's pretty much a best-case scenario for this approach; we should test with much wider rows, too.
          Hide
          Jonathan Ellis added a comment -

          This approach is fine for proof of concept, but be aware that any sstable format change that we actually commit is going to need to support reading the old version. So ultimately what a patch set like this needs to look like is

          00: provide APIs that CFS et al can use to read data from either old or new versions (e.g. getScanner, getFileDataInput); probably you will end up with an AbstractSSTableReader class with common functionality like getColumnComparator
          01: refactor old SSTR class and callers to use the new API
          02: introduce the new data file format in separate classes

          Splitting it up like this is also going to be much much easier to rebase against the moving target of trunk (and there is enough missing here that it looks like it's going to need to be rebased for a while).

          Show
          Jonathan Ellis added a comment - This approach is fine for proof of concept, but be aware that any sstable format change that we actually commit is going to need to support reading the old version. So ultimately what a patch set like this needs to look like is 00: provide APIs that CFS et al can use to read data from either old or new versions (e.g. getScanner, getFileDataInput); probably you will end up with an AbstractSSTableReader class with common functionality like getColumnComparator 01: refactor old SSTR class and callers to use the new API 02: introduce the new data file format in separate classes Splitting it up like this is also going to be much much easier to rebase against the moving target of trunk (and there is enough missing here that it looks like it's going to need to be rebased for a while).
          Hide
          Stu Hood added a comment -

          List of features stubbed as "FIXME: not implemented" in v1:
          1. Reverse slicing within CFs is not implemented (see SSTableSliceIterator),
          2. Reading SuperColumns is disabled (see SSTable(Slice|Names)Iterator),
          3. The recently added MMAP support for data files is disabled until I can port this SSTableScanner interface to use it (see SSTableReader),
          4. AntiEntropyService is not hashing slices (meaning that major compactions always fail).
          5. SSTable(Import|Export) are broken,
          6. BinaryMemtables will crash on flush,
          7. The bytesRead MBean for CompactionManager is disabled,
          8. AntiCompaction is not using the 'skip ranges we don`t need' optimization.

          Also, I lied in the description above: the patch does not have GZIP compression enabled, but you can add two lines to enable it: add a GZIPInputStream to the chain in SSTableReader.Block.stream(), and a GZIPOutputStream to the chain in SSTableWriter.BlockContext.flushSlice(). There is a memory leak related to reading from compressed blocks which will quickly kill the server, but it should be easy to track down.

          Finally, there are tons of other TODOs/FIXMEs scattered around, many of which should be tackled in other tickets.

          Show
          Stu Hood added a comment - List of features stubbed as "FIXME: not implemented" in v1: 1. Reverse slicing within CFs is not implemented (see SSTableSliceIterator), 2. Reading SuperColumns is disabled (see SSTable(Slice|Names)Iterator), 3. The recently added MMAP support for data files is disabled until I can port this SSTableScanner interface to use it (see SSTableReader), 4. AntiEntropyService is not hashing slices (meaning that major compactions always fail). 5. SSTable(Import|Export) are broken, 6. BinaryMemtables will crash on flush, 7. The bytesRead MBean for CompactionManager is disabled, 8. AntiCompaction is not using the 'skip ranges we don`t need' optimization. Also, I lied in the description above: the patch does not have GZIP compression enabled, but you can add two lines to enable it: add a GZIPInputStream to the chain in SSTableReader.Block.stream(), and a GZIPOutputStream to the chain in SSTableWriter.BlockContext.flushSlice(). There is a memory leak related to reading from compressed blocks which will quickly kill the server, but it should be easy to track down. Finally, there are tons of other TODOs/FIXMEs scattered around, many of which should be tackled in other tickets.

            People

            • Assignee:
              Unassigned
              Reporter:
              Stu Hood
            • Votes:
              16 Vote for this issue
              Watchers:
              55 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development