Details

    • Type: Improvement
    • Status: Awaiting Feedback
    • Priority: Minor
    • Resolution: Unresolved
    • Fix Version/s: 4.x
    • Component/s: None
    • Labels:
      None

      Description

      Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for GC. Can this be improved by not creating so many objects?

      1. gc_collection_times_with_birch.png
        172 kB
        Michael Kjellman
      2. gc_collection_times_without_birch.png
        260 kB
        Michael Kjellman
      3. gc_counts_with_birch.png
        114 kB
        Michael Kjellman
      4. gc_counts_without_birch.png
        107 kB
        Michael Kjellman
      5. perf_cluster_1_with_birch_read_latency_and_counts.png
        78 kB
        Michael Kjellman
      6. perf_cluster_2_with_birch_read_latency_and_counts.png
        83 kB
        Michael Kjellman
      7. perf_cluster_2_with_birch_write_latency_and_counts.png
        80 kB
        Michael Kjellman
      8. perf_cluster_3_without_birch_write_latency_and_counts.png
        109 kB
        Michael Kjellman
      9. perf_cluster_1_with_birch_write_latency_and_counts.png
        77 kB
        Michael Kjellman
      10. perf_cluster_3_without_birch_read_latency_and_counts.png
        99 kB
        Michael Kjellman
      11. 0f8e28c220fd5af6c7b5dd2d3dab6936c4aa4b6b.patch
        5 kB
        Jeff Jirsa

        Issue Links

          Activity

          Hide
          mkjellman Michael Kjellman added a comment -

          Just wanted to give a quick update:

          1. I'm really sorry for the delay getting this finished for trunk. I've started a trunk based/post 8099 version 3 times now – the holidays happened – more pressing things stole my attention – big commits like removing Thrift, CFMetadata, etc, etc kept getting committed before I was done, and well – enough excuses from me...
          2. A belated thanks for your initial comments Branimir – I did read them and I'll be addressing them with my trunk rebased changes.
          3. I'm almost done with the refactoring to move all the current array based index logic into a IndexEntry implementation. I have all unit tests passing (finally) with the exception of KeyCacheCqlTest (which I'm working on right now).
          4. Assuming I get the post 8099 Indexed Iterator based abstractions/changes correct it should be a matter of just dropping in the Birch package/classes I had for 2.1 and switching the default serializer to use the Birch IndexedEntry implementation.
          Show
          mkjellman Michael Kjellman added a comment - Just wanted to give a quick update: I'm really sorry for the delay getting this finished for trunk. I've started a trunk based/post 8099 version 3 times now – the holidays happened – more pressing things stole my attention – big commits like removing Thrift, CFMetadata, etc, etc kept getting committed before I was done, and well – enough excuses from me... A belated thanks for your initial comments Branimir – I did read them and I'll be addressing them with my trunk rebased changes. I'm almost done with the refactoring to move all the current array based index logic into a IndexEntry implementation. I have all unit tests passing (finally) with the exception of KeyCacheCqlTest (which I'm working on right now). Assuming I get the post 8099 Indexed Iterator based abstractions/changes correct it should be a matter of just dropping in the Birch package/classes I had for 2.1 and switching the default serializer to use the Birch IndexedEntry implementation.
          Hide
          blambov Branimir Lambov added a comment -

          As this is not a critical bug-fix and hence the 2.1 version cannot go into the codebase, unfortunately I cannot justify investing the time to give it a full review until we have a trunk patch.

          I looked at the main BirchReader/Writer components again as they are likely to stay the same for a trunk patch. Here are my comments:

          • I still think not storing the first key in intermediate nodes would save significant amounts of space and time and should be implemented.
          • There are a pair of methods called binarySearch which return the floor (less than or equal) for a given key. I would prefer them to be named after what they produce, as binarySearch implies a certain kind of result (negative for non-equal) and the fact that it is implemented through binary search is largely an implementation detail.
          • entries == 1 check looks suspect, as there should be no need for one-entry nodes in the tree. Could you comment why it is necessary?
          • I think it is better to handle the special meaning of empty key and the reversed flag first thing in the search method rather than propagating it into the binarySearch calls, especially since you know the position of the first (0) and last (descriptor.getFirstNodeOffset() - descriptor.getAlignedPageSize()) leaf nodes in the tree. The iterator initialization already does that.
          • The meaning of "matching" in the search doc is unclear. What happens on no equal element present? If it returns the floor, please state so.
          • BirchIterator could use a forward/reversed implementation split.
          • There's a lot of potential off-by-one mishap in BirchIterator, and not only in the reverse case:
            • the first element returned in the forward case can be less than searchKey (if not equal);
            • the respective problem is also there in the reverse case;
            • the page we find the entry in during initialization (currentPage) is not the page we apply that index to during the first computeNext (currentPage - 1);
            • the column iteration code probably masks the first two at some non-trivial efficiency cost, but the latter looks like something that can get to the surface as missed data.
          • The level generation loop in BirchWriter is effectively a while (true) loop as we always reset inProgressInnerNodeLevel before going into it.
          • The list should never become empty, thus the emptyness check is suspect – if necessary it would indicate an error in the logic; I'd replace is with a non-empty assertion.
          Show
          blambov Branimir Lambov added a comment - As this is not a critical bug-fix and hence the 2.1 version cannot go into the codebase, unfortunately I cannot justify investing the time to give it a full review until we have a trunk patch. I looked at the main BirchReader/Writer components again as they are likely to stay the same for a trunk patch. Here are my comments: I still think not storing the first key in intermediate nodes would save significant amounts of space and time and should be implemented. There are a pair of methods called binarySearch which return the floor (less than or equal) for a given key. I would prefer them to be named after what they produce, as binarySearch implies a certain kind of result (negative for non-equal) and the fact that it is implemented through binary search is largely an implementation detail. entries == 1 check looks suspect, as there should be no need for one-entry nodes in the tree. Could you comment why it is necessary? I think it is better to handle the special meaning of empty key and the reversed flag first thing in the search method rather than propagating it into the binarySearch calls, especially since you know the position of the first ( 0 ) and last ( descriptor.getFirstNodeOffset() - descriptor.getAlignedPageSize() ) leaf nodes in the tree. The iterator initialization already does that. The meaning of "matching" in the search doc is unclear. What happens on no equal element present? If it returns the floor, please state so. BirchIterator could use a forward/reversed implementation split. There's a lot of potential off-by-one mishap in BirchIterator , and not only in the reverse case: the first element returned in the forward case can be less than searchKey (if not equal); the respective problem is also there in the reverse case; the page we find the entry in during initialization ( currentPage ) is not the page we apply that index to during the first computeNext ( currentPage - 1 ); the column iteration code probably masks the first two at some non-trivial efficiency cost, but the latter looks like something that can get to the surface as missed data. The level generation loop in BirchWriter is effectively a while (true) loop as we always reset inProgressInnerNodeLevel before going into it. The list should never become empty, thus the emptyness check is suspect – if necessary it would indicate an error in the logic; I'd replace is with a non-empty assertion.
          Hide
          mkjellman Michael Kjellman added a comment -

          Loic Lambiel Yes, we ran out of disk space before the code fell over. We had some 250GB partitions when we finally ran out of disk space. Waiting on review and comments from Branimir Lambov and I'm working on the trunk version. I have most of the unit tests passing althuogh the new RangeTombstoneBounds etc is proving pretty fragile and giving me a bit of pain.

          Jeff Jirsa as we've discussed i'm 99.9% sure we should go with your changes too.

          Show
          mkjellman Michael Kjellman added a comment - Loic Lambiel Yes, we ran out of disk space before the code fell over. We had some 250GB partitions when we finally ran out of disk space. Waiting on review and comments from Branimir Lambov and I'm working on the trunk version. I have most of the unit tests passing althuogh the new RangeTombstoneBounds etc is proving pretty fragile and giving me a bit of pain. Jeff Jirsa as we've discussed i'm 99.9% sure we should go with your changes too.
          Hide
          llambiel Loic Lambiel added a comment -

          Any update on your ongoing tests Michael Kjellman ?

          Show
          llambiel Loic Lambiel added a comment - Any update on your ongoing tests Michael Kjellman ?
          Hide
          jjirsa Jeff Jirsa added a comment -

          For tables with many non-indexed rows, this patch (0f8e28c220fd5af6c7b5dd2d3dab6936c4aa4b6b.patch) will pack non-indexed row entries behind the 4k aligned birch segments rather than aligning to the boundary for each (nearly empty) entry. That is - rather than aligning AFTER each time we write a new segment/subsegment, we align before we write a new birch tree/segment, and for non-aligned segments/subsegments, we skip the align entirely to save space.

          I've not tested the performance impact of this patch, but in terms of size on disk, it creates an -Index component approximately 98% smaller than the birch/9754 patch for an example table where the mean partition size is ~5k .

          Thoughts on this, Michael Kjellman ?

          Show
          jjirsa Jeff Jirsa added a comment - For tables with many non-indexed rows, this patch (0f8e28c220fd5af6c7b5dd2d3dab6936c4aa4b6b.patch) will pack non-indexed row entries behind the 4k aligned birch segments rather than aligning to the boundary for each (nearly empty) entry. That is - rather than aligning AFTER each time we write a new segment/subsegment, we align before we write a new birch tree/segment, and for non-aligned segments/subsegments, we skip the align entirely to save space. I've not tested the performance impact of this patch, but in terms of size on disk, it creates an -Index component approximately 98% smaller than the birch/9754 patch for an example table where the mean partition size is ~5k . Thoughts on this, Michael Kjellman ?
          Hide
          mkjellman Michael Kjellman added a comment -

          The large partitions hit 158GB today. Latencies are still stable and unchanged. Compaction backed up a little at one point but has fully caught up (with no change in actual write/read tps). I'm working very hard on a post 8099 version of the patch for trunk/3.0.x.

          Show
          mkjellman Michael Kjellman added a comment - The large partitions hit 158GB today. Latencies are still stable and unchanged. Compaction backed up a little at one point but has fully caught up (with no change in actual write/read tps). I'm working very hard on a post 8099 version of the patch for trunk/3.0.x.
          Hide
          mkjellman Michael Kjellman added a comment -

          Great. I'm working on a trunk based version now. 8099 is really fun!

          Show
          mkjellman Michael Kjellman added a comment - Great. I'm working on a trunk based version now. 8099 is really fun!
          Hide
          xedin Pavel Yaskevich added a comment -

          I'm planning to take a closer look at the code etc. soon, so if I see something or have any ideas I'll let you know!

          Show
          xedin Pavel Yaskevich added a comment - I'm planning to take a closer look at the code etc. soon, so if I see something or have any ideas I'll let you know!
          Hide
          mkjellman Michael Kjellman added a comment -

          sure, let me change it now.

          also, If you have any input on the overall way I'm testing and generating load please let me know – I really did try to make it as realistic as I could and we discussed it internally over here but I'm all ears if you have a different kind of load I'm missing that isn't making it an accurate test so far for certain workloads.

          Show
          mkjellman Michael Kjellman added a comment - sure, let me change it now. also, If you have any input on the overall way I'm testing and generating load please let me know – I really did try to make it as realistic as I could and we discussed it internally over here but I'm all ears if you have a different kind of load I'm missing that isn't making it an accurate test so far for certain workloads.
          Hide
          xedin Pavel Yaskevich added a comment -

          Michael Kjellman Maybe "largeuuid1"? Looks like rows there were about ~300KB too, which is reasonable.

          Show
          xedin Pavel Yaskevich added a comment - Michael Kjellman Maybe "largeuuid1"? Looks like rows there were about ~300KB too, which is reasonable.
          Hide
          mkjellman Michael Kjellman added a comment -

          Sure, I can change the test right now. Which table specifically are you talking about adding more keys, it's a single command line parameter and a restart of the perf load? I'll need to bounce the cluster for the key cache change obviously though.

          The control cluster which had 2.1.16 without Birch which I did on purpose to see how performance was with Birch vs without to specifically make sure there wasn't a regression at the low end like you're rightfully concerned about (as I am/was too).

          Show
          mkjellman Michael Kjellman added a comment - Sure, I can change the test right now. Which table specifically are you talking about adding more keys, it's a single command line parameter and a restart of the perf load? I'll need to bounce the cluster for the key cache change obviously though. The control cluster which had 2.1.16 without Birch which I did on purpose to see how performance was with Birch vs without to specifically make sure there wasn't a regression at the low end like you're rightfully concerned about (as I am/was too).
          Hide
          mkjellman Michael Kjellman added a comment -

          One idea I've had for a while is that we could switch the current Summary implementation to just having it be a Birch tree itself with all keys (not sampled). You could then do a lookup into the row index to get the offset to the columns index in what we call the "primary index" today. Then you'd have a tree per row like we have today.

          Show
          mkjellman Michael Kjellman added a comment - One idea I've had for a while is that we could switch the current Summary implementation to just having it be a Birch tree itself with all keys (not sampled). You could then do a lookup into the row index to get the offset to the columns index in what we call the "primary index" today. Then you'd have a tree per row like we have today.
          Hide
          xedin Pavel Yaskevich added a comment -

          I'm actually only am using the key cache in the current implementation

          I wanted to mention that purely from looking up key in the key cache perspective, I've assumed that index is only going to have key offsets in it, so we are on the same page.

          Branimir Lambov Is there any way you can run this through automated perf stress test? Since the size of the tree attached to the key is bigger than it was originally, I'm curious what is performance difference in conditions where rows are just barely big enough to be indexed and there are a lot of keys.

          Michael Kjellman I understand that the test you are running is designed to check what is the performance like relative to the Birch tree itself, but is there there any chance you can disable key cache and generate some more keys (maybe ~100k?) to see how changes to the column index are affecting read path top-down?

          Show
          xedin Pavel Yaskevich added a comment - I'm actually only am using the key cache in the current implementation I wanted to mention that purely from looking up key in the key cache perspective, I've assumed that index is only going to have key offsets in it, so we are on the same page. Branimir Lambov Is there any way you can run this through automated perf stress test? Since the size of the tree attached to the key is bigger than it was originally, I'm curious what is performance difference in conditions where rows are just barely big enough to be indexed and there are a lot of keys. Michael Kjellman I understand that the test you are running is designed to check what is the performance like relative to the Birch tree itself, but is there there any chance you can disable key cache and generate some more keys (maybe ~100k?) to see how changes to the column index are affecting read path top-down?
          Hide
          mkjellman Michael Kjellman added a comment -

          In regards to your second point: I'm actually only am using the key cache in the current implementation if a) it's a legacy index that hasn't been upgraded yet (to keep performance for indexed rows the same during upgrades) b) if the row is "non" indexed, or < 64kb so just the starting offset.

          For Birch indexed rows they always come from the Birch impl on disk and don't get stored in the key cache at all. Ideally I think it would be great if we could get rid of the key cache all together! There was some chat about this in the ticket earlier...

          There is the index summary which has an offset for keys as they are sampled during compaction which let you skip to a given starting file offset inside the index for a key which reduces the problem you're talking about. I don't think the performance of the small-to-medium sized case should be any different with the Birch implementation than the current implementation and I'm trying to test that with the workload going on for the test_keyspace.largeuuid1 table. The issue with the Birch implementation vs the current though is going to be the size of the index file on disk due to the segments being aligned on 4kb boundaries. I've talked a bunch about this and thrown some ideas around with people and I think maybe the best case would be to check if the previously added row was a non-indexed segment (so just a long for the start of the partition in the index and no tree being built) and then don't align the file to a boundary for those cases. The real issue is I don't know the length ahead of time so I can't just encode the aligned segments at the end starting at some starting offset and encode relative offsets iteratively during compaction. Any thoughts on this would be really appreciated though...

          Show
          mkjellman Michael Kjellman added a comment - In regards to your second point: I'm actually only am using the key cache in the current implementation if a) it's a legacy index that hasn't been upgraded yet (to keep performance for indexed rows the same during upgrades) b) if the row is "non" indexed, or < 64kb so just the starting offset. For Birch indexed rows they always come from the Birch impl on disk and don't get stored in the key cache at all. Ideally I think it would be great if we could get rid of the key cache all together! There was some chat about this in the ticket earlier... There is the index summary which has an offset for keys as they are sampled during compaction which let you skip to a given starting file offset inside the index for a key which reduces the problem you're talking about. I don't think the performance of the small-to-medium sized case should be any different with the Birch implementation than the current implementation and I'm trying to test that with the workload going on for the test_keyspace.largeuuid1 table. The issue with the Birch implementation vs the current though is going to be the size of the index file on disk due to the segments being aligned on 4kb boundaries. I've talked a bunch about this and thrown some ideas around with people and I think maybe the best case would be to check if the previously added row was a non-indexed segment (so just a long for the start of the partition in the index and no tree being built) and then don't align the file to a boundary for those cases. The real issue is I don't know the length ahead of time so I can't just encode the aligned segments at the end starting at some starting offset and encode relative offsets iteratively during compaction. Any thoughts on this would be really appreciated though...
          Hide
          mkjellman Michael Kjellman added a comment -

          Here is cfstats from one of the instances.

          Keyspace: test_keyspace
          	Read Count: 114179492
          	Read Latency: 1.6377607135701742 ms.
          	Write Count: 662747473
          	Write Latency: 0.030130128499184786 ms.
          	Pending Flushes: 0
          		Table: largetext1
          		SSTable count: 26
          		SSTables in each level: [0, 3, 7, 8, 8, 0, 0, 0, 0]
          		Space used (live): 434883821719
          		Space used (total): 434883821719
          		Space used by snapshots (total): 0
          		Off heap memory used (total): 67063584
          		SSTable Compression Ratio: 0.7882047641965452
          		Number of keys (estimate): 14
          		Memtable cell count: 58930
          		Memtable data size: 25518748
          		Memtable off heap memory used: 0
          		Memtable switch count: 3416
          		Local read count: 71154231
          		Local read latency: 2.468 ms
          		Local write count: 410631676
          		Local write latency: 0.030 ms
          		Pending flushes: 0
          		Bloom filter false positives: 0
          		Bloom filter false ratio: 0.00000
          		Bloom filter space used: 496
          		Bloom filter off heap memory used: 288
          		Index summary off heap memory used: 1144
          		Compression metadata off heap memory used: 67062152
          		Compacted partition minimum bytes: 20924301
          		Compacted partition maximum bytes: 91830775932
          		Compacted partition mean bytes: 19348020195
          		Average live cells per slice (last five minutes): 0.9998001524322566
          		Maximum live cells per slice (last five minutes): 1.0
          		Average tombstones per slice (last five minutes): 0.0
          		Maximum tombstones per slice (last five minutes): 0.0
          
          		Table: largeuuid1
          		SSTable count: 59
          		SSTables in each level: [1, 10, 48, 0, 0, 0, 0, 0, 0]
          		Space used (live): 9597872057
          		Space used (total): 9597872057
          		Space used by snapshots (total): 0
          		Off heap memory used (total): 3960428
          		SSTable Compression Ratio: 0.2836031289299396
          		Number of keys (estimate): 27603
          		Memtable cell count: 228244
          		Memtable data size: 7874514
          		Memtable off heap memory used: 0
          		Memtable switch count: 521
          		Local read count: 18463741
          		Local read latency: 0.271 ms
          		Local write count: 108570121
          		Local write latency: 0.031 ms
          		Pending flushes: 0
          		Bloom filter false positives: 0
          		Bloom filter false ratio: 0.00000
          		Bloom filter space used: 22008
          		Bloom filter off heap memory used: 21536
          		Index summary off heap memory used: 11308
          		Compression metadata off heap memory used: 3927584
          		Compacted partition minimum bytes: 42511
          		Compacted partition maximum bytes: 4866323
          		Compacted partition mean bytes: 1290148
          		Average live cells per slice (last five minutes): 0.9992537806937392
          		Maximum live cells per slice (last five minutes): 1.0
          		Average tombstones per slice (last five minutes): 0.0
          		Maximum tombstones per slice (last five minutes): 0.0
          
          		Table: timeuuid1
          		SSTable count: 7
          		SSTables in each level: [0, 1, 3, 3, 0, 0, 0, 0, 0]
          		Space used (live): 103161816378
          		Space used (total): 103161816378
          		Space used by snapshots (total): 0
          		Off heap memory used (total): 13820716
          		SSTable Compression Ratio: 0.9105016396444802
          		Number of keys (estimate): 6
          		Memtable cell count: 150596
          		Memtable data size: 41378801
          		Memtable off heap memory used: 0
          		Memtable switch count: 1117
          		Local read count: 24561527
          		Local read latency: 0.264 ms
          		Local write count: 143545778
          		Local write latency: 0.033 ms
          		Pending flushes: 0
          		Bloom filter false positives: 0
          		Bloom filter false ratio: 0.00000
          		Bloom filter space used: 128
          		Bloom filter off heap memory used: 72
          		Index summary off heap memory used: 308
          		Compression metadata off heap memory used: 13820336
          		Compacted partition minimum bytes: 25109161
          		Compacted partition maximum bytes: 76525646610
          		Compacted partition mean bytes: 13722083374
          		Average live cells per slice (last five minutes): 0.9993586310818542
          		Maximum live cells per slice (last five minutes): 1.0
          		Average tombstones per slice (last five minutes): 0.0
          		Maximum tombstones per slice (last five minutes): 0.0
          
          Show
          mkjellman Michael Kjellman added a comment - Here is cfstats from one of the instances. Keyspace: test_keyspace Read Count: 114179492 Read Latency: 1.6377607135701742 ms. Write Count: 662747473 Write Latency: 0.030130128499184786 ms. Pending Flushes: 0 Table: largetext1 SSTable count: 26 SSTables in each level: [0, 3, 7, 8, 8, 0, 0, 0, 0] Space used (live): 434883821719 Space used (total): 434883821719 Space used by snapshots (total): 0 Off heap memory used (total): 67063584 SSTable Compression Ratio: 0.7882047641965452 Number of keys (estimate): 14 Memtable cell count: 58930 Memtable data size: 25518748 Memtable off heap memory used: 0 Memtable switch count: 3416 Local read count: 71154231 Local read latency: 2.468 ms Local write count: 410631676 Local write latency: 0.030 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 496 Bloom filter off heap memory used: 288 Index summary off heap memory used: 1144 Compression metadata off heap memory used: 67062152 Compacted partition minimum bytes: 20924301 Compacted partition maximum bytes: 91830775932 Compacted partition mean bytes: 19348020195 Average live cells per slice (last five minutes): 0.9998001524322566 Maximum live cells per slice (last five minutes): 1.0 Average tombstones per slice (last five minutes): 0.0 Maximum tombstones per slice (last five minutes): 0.0 Table: largeuuid1 SSTable count: 59 SSTables in each level: [1, 10, 48, 0, 0, 0, 0, 0, 0] Space used (live): 9597872057 Space used (total): 9597872057 Space used by snapshots (total): 0 Off heap memory used (total): 3960428 SSTable Compression Ratio: 0.2836031289299396 Number of keys (estimate): 27603 Memtable cell count: 228244 Memtable data size: 7874514 Memtable off heap memory used: 0 Memtable switch count: 521 Local read count: 18463741 Local read latency: 0.271 ms Local write count: 108570121 Local write latency: 0.031 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 22008 Bloom filter off heap memory used: 21536 Index summary off heap memory used: 11308 Compression metadata off heap memory used: 3927584 Compacted partition minimum bytes: 42511 Compacted partition maximum bytes: 4866323 Compacted partition mean bytes: 1290148 Average live cells per slice (last five minutes): 0.9992537806937392 Maximum live cells per slice (last five minutes): 1.0 Average tombstones per slice (last five minutes): 0.0 Maximum tombstones per slice (last five minutes): 0.0 Table: timeuuid1 SSTable count: 7 SSTables in each level: [0, 1, 3, 3, 0, 0, 0, 0, 0] Space used (live): 103161816378 Space used (total): 103161816378 Space used by snapshots (total): 0 Off heap memory used (total): 13820716 SSTable Compression Ratio: 0.9105016396444802 Number of keys (estimate): 6 Memtable cell count: 150596 Memtable data size: 41378801 Memtable off heap memory used: 0 Memtable switch count: 1117 Local read count: 24561527 Local read latency: 0.264 ms Local write count: 143545778 Local write latency: 0.033 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 128 Bloom filter off heap memory used: 72 Index summary off heap memory used: 308 Compression metadata off heap memory used: 13820336 Compacted partition minimum bytes: 25109161 Compacted partition maximum bytes: 76525646610 Compacted partition mean bytes: 13722083374 Average live cells per slice (last five minutes): 0.9993586310818542 Maximum live cells per slice (last five minutes): 1.0 Average tombstones per slice (last five minutes): 0.0 Maximum tombstones per slice (last five minutes): 0.0
          Hide
          xedin Pavel Yaskevich added a comment -

          Michael Kjellman This looks great! Can you please post information regarding SSTables sizes and their estimated key counts as well? AFAIR there exists another problem related to how indexes are currently stored - if key is not in the key cache there is no way to jump to it directly in the index file, index reader has to scan through index segment to find requested key, so I'm wondering what happens in the situation when there are many keys which are small-to-medium sized e.g. 64-128 MB in each given SSTable (let's say SSTable size is set to 1G or 2G) and stress readers are trying to read random keys, what would be the difference between current index read performance vs. index + birch tree?...

          Show
          xedin Pavel Yaskevich added a comment - Michael Kjellman This looks great! Can you please post information regarding SSTables sizes and their estimated key counts as well? AFAIR there exists another problem related to how indexes are currently stored - if key is not in the key cache there is no way to jump to it directly in the index file, index reader has to scan through index segment to find requested key, so I'm wondering what happens in the situation when there are many keys which are small-to-medium sized e.g. 64-128 MB in each given SSTable (let's say SSTable size is set to 1G or 2G) and stress readers are trying to read random keys, what would be the difference between current index read performance vs. index + birch tree?...
          Hide
          mkjellman Michael Kjellman added a comment -

          Not a "stupid" question at all! There is certainly a bit more overhead here than what we did before, however, I'm closely monitoring compaction in these tests and Pending Tasks isn't backing up so at this read/write load it seems like the additional work is negligible in real world terms.

          Show
          mkjellman Michael Kjellman added a comment - Not a "stupid" question at all! There is certainly a bit more overhead here than what we did before, however, I'm closely monitoring compaction in these tests and Pending Tasks isn't backing up so at this read/write load it seems like the additional work is negligible in real world terms.
          Hide
          mkjellman Michael Kjellman added a comment -

          Test clusters have crossed 110GB for the large CQL Partitions!!! Latency is still stable

          Show
          mkjellman Michael Kjellman added a comment - Test clusters have crossed 110GB for the large CQL Partitions!!! Latency is still stable
          Hide
          doanduyhai DOAN DuyHai added a comment -

          Stupid question: how are those improvement affect compaction ? Did you also monitor the compaction time during your benchmark tests and compare the time taken by each impl ?

          Show
          doanduyhai DOAN DuyHai added a comment - Stupid question: how are those improvement affect compaction ? Did you also monitor the compaction time during your benchmark tests and compare the time taken by each impl ?
          Hide
          mkjellman Michael Kjellman added a comment -

          All of the threads that were responsible for generating load in the control cluster for the two large partition read and write workloads had died because the cluster became so unstable. As soon as I just restarted the stress load 60% of the instances in the cluster OOMed within 2 minutes of restarting the load.

          At this point I don't think I can drive any more data into the partitions with the old code and I'm going to declare defeat and say that 17GB as the absolute max partition size possible with the old/previous/current index implementation (given the JVM parameters as I detailed in the test description above).

          I'm going to leave the load at the current read and write rates in the two Birch clusters until things explode to see the theoretical max partition size possible with the Birch implementation today. After that I'll wipe the clusters and restart the same load at 2x the read and write rates to see how things go with that configuration.

          Show
          mkjellman Michael Kjellman added a comment - All of the threads that were responsible for generating load in the control cluster for the two large partition read and write workloads had died because the cluster became so unstable. As soon as I just restarted the stress load 60% of the instances in the cluster OOMed within 2 minutes of restarting the load. At this point I don't think I can drive any more data into the partitions with the old code and I'm going to declare defeat and say that 17GB as the absolute max partition size possible with the old/previous/current index implementation (given the JVM parameters as I detailed in the test description above). I'm going to leave the load at the current read and write rates in the two Birch clusters until things explode to see the theoretical max partition size possible with the Birch implementation today. After that I'll wipe the clusters and restart the same load at 2x the read and write rates to see how things go with that configuration.
          Hide
          mkjellman Michael Kjellman added a comment -

          I have even more great news! My two test clusters just crossed 53GB for the 90th percentile for max row size. The 50th percentile for mean row size is ~8.5GB. Read and write latencies are still the same as the numbers I posted above from 3 days ago. So you could have an entire cluster of 10GB rows and still be stable

          Show
          mkjellman Michael Kjellman added a comment - I have even more great news! My two test clusters just crossed 53GB for the 90th percentile for max row size. The 50th percentile for mean row size is ~8.5GB. Read and write latencies are still the same as the numbers I posted above from 3 days ago. So you could have an entire cluster of 10GB rows and still be stable
          Hide
          mkjellman Michael Kjellman added a comment -

          Morning update The stress load has continued in all partitions since the last update. The large partitions have grown to ~21GB. Latencies are still unchanged for both reads and writes in all percentiles!! Onwards to the next milestone, 50GB! I also doubled the read and write load around 10 hours ago to 4k reads/sec and 10k writes/sec to grow the partitions faster.

          Show
          mkjellman Michael Kjellman added a comment - Morning update The stress load has continued in all partitions since the last update. The large partitions have grown to ~21GB. Latencies are still unchanged for both reads and writes in all percentiles!! Onwards to the next milestone, 50GB! I also doubled the read and write load around 10 hours ago to 4k reads/sec and 10k writes/sec to grow the partitions faster.
          Hide
          mkjellman Michael Kjellman added a comment - - edited

          Attaching an initial set of very rough graphs showing the last 12 hours of stress/performance testing that's been running. I apologize ahead of time for some of the graphs – I wanted to include the average, p99.9th, and count for all key metrics and in some cases some of the values overlapped and my graphing foo wasn't good enough to improve the readability. I'll take another pass when I get some time with the next round of performance testing. The "large" CQL partitions in all 3 clusters are currently (and during the duration of the test) between ~6GB and ~12.5GB, although I'm planning on running the stress/performance tests in all 3 clusters until the "large" CQL partitions hits ~50GB. The load was started in all 3 clusters (where all 3 were totally empty at start) at the same time – from the same stress tool code that I wrote specifically to realistically test Birch as after repeated attempts to generate a good workload with cassandra-stress I gave up. Some details of the stress tool and load that was being generated for these graphs is below.

          There are three read-write workloads being run to generate the load during these tests.

          I wrote the following two methods for my "simple-cassandra-stress" tool I threw together to generate keys that the worker-threads operate on. I'll refer to them below in terms of how the stress load is currently being generated.

          public static List<HashCode> generateRandomKeys(int number) {
              List<HashCode> keysToOperateOn = new ArrayList<>();
              HashFunction hf = Hashing.murmur3_128();
              for (int i = 0; i < number; i++) {
                  HashCode hashedKey = hf.newHasher().putLong(RANDOM_THREAD_LOCAL.get().nextInt(300000) + 1).hash();
                  keysToOperateOn.add(hashedKey);
              }
              return keysToOperateOn;
          }
          
          public static List<HashCode> generateEvenlySpacedPredictableKeys(int number, int offset,
                                                                           String seed, Cluster cluster) throws InvalidParameterException {
              Set<TokenRange> tokenRanges = cluster.getMetadata().getTokenRanges();
              int numberOfKeysToGenerate = (number < tokenRanges.size()) ? tokenRanges.size() : number;
          
              Long[] tokens = new Long[numberOfKeysToGenerate];
          
              int pos = 0;
          
              int numberOfSplits = (number <= tokenRanges.size()) ? 1 : (number / tokenRanges.size()) + 1;
              for (TokenRange tokenRange : tokenRanges) {
                  for (TokenRange splitTokenRange : tokenRange.splitEvenly(numberOfSplits)) {
                      if (pos >= tokens.length)
                          break;
          
                      tokens[pos++] = (Long) splitTokenRange.getStart().getValue();
                  }
          
                  if (pos >= tokens.length)
                      break;
              }
          
              HashCode[] randomKeys = new HashCode[tokens.length];
              int pendingRandomKeys = tokens.length;
              while (pendingRandomKeys > 0) {
                  for (int i = offset; i < (offset + numberOfKeysToGenerate) * (number * 10); i++) {
                      if (pendingRandomKeys <= 0)
                          break;
          
                      HashFunction hf = Hashing.murmur3_128();
                      HashCode hashedKey = hf.newHasher().putString(seed, Charset.defaultCharset()).putInt(i).hash();
          
                      for (int t = 0; t < tokens.length; t++) {
                          if ((t + 1 == tokens.length && hashedKey.asLong() >= tokens[t]) || (hashedKey.asLong() >= tokens[t] && hashedKey.asLong() < tokens[t + 1])) {
                              if (randomKeys[t] == null) {
                                  randomKeys[t] = hashedKey;
                                  pendingRandomKeys--;
                              }
          
                              break;
                          }
                      }
                  }
              }
          
              return Arrays.asList(randomKeys);
          }
          

          There are 12 Cassandra instances in each performance/stress cluster running JDK 1.8_u74 with the CMS collector (obviously simplified) running with -Xms5G -Xmx5G -Xmn1G.

          The test keyspace is created with RF=3:

          CREATE KEYSPACE IF NOT EXISTS test_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': 3}
          

          Operations for test_keyspace.largeuuid1 generate a new key to insert and read from at the top of every iteration with generateRandomKeys(1). Each worker then generates 10,000 random mutations, with the current timeuuid and a random value blob of 30 bytes to 2kb. This is intended to get some more "normal" load on the cluster.

          CREATE TABLE IF NOT EXISTS test_keyspace.timeuuid1 (name text, col1 timeuuid, value blob, primary key(name, col1)) WITH compaction = { 'class':'LeveledCompactionStrategy' }
          
          "INSERT INTO test_keyspace.largeuuid1 (name, col1, value) VALUES (?, ?, ?)"
          "SELECT * FROM test_keyspace.largeuuid1 WHERE name = ? and col1 = ?"
          

          The second and third generated workload attempt to stress the large row size element of this work. The goal here is to create infinitely growing partitions. test_keyspace.largetext1 and test_keyspace.largeuuid1 are largely the same except that test_keyspace.largetext1 is intended to also stress the Overflow logic for large composite keys. A key design element of Birch is it's support for variable length keys. Cassandra supports row keys up to a maximum length of unsigned short. To have predictable performance in the tree implementation however, supporting keys of length unsigned short as first class citizens would aversely hurt the performance of the 99.999% of other normal sized keys. To support these large keys (but not hurt the performance of normal sized keys) a Birch node/leaf will contain up to ((size_of_leaf_node / 2) / 2), where size_of_leaf_node is 4kb by default and we divide by 2 to accommodate for serializing/inserting at least 2 elements in a single node. This results in a key of length <= 1kb being supported without any special handling which should cover the use cases of almost everyone in the world.

          For keys that exceed that length, the rest of the bytes are written into a single Overflow page which is shared between all inner + leaf nodes and is not page aligned. This means we will keep 1kb worth of the key (assuming a 4kb Birch node size) inside the node itself and the rest in the Overflow page. If we need to read that key we can grab the bytes from the node + overflow page inline during the tree operation and re-assemble the entire variable key. This has a slight performance cost (of course) as it requires the allocation of an additional byte[], an additional seek, and additional reads.

          To exercise this, col1 in test_keyspace.largetext1 is a randomly generated string from 300-4kb – and conversely to see the performance without the Overflow logic (what will almost always be the case in real life as row keys > 1kb are pretty ridiculous ) test_keyspace.largeuuid1 uses a simple randomly generated UUID for it's primary key.

          generateEvenlySpacedPredictableKeys() (see above) was written to generate a predicable set of pseudo-random keys (where the same seed will generate the same "random" keys). The logic is a bit complicated as I found that just randomly generating n-keys didn't guarantee the load would be evenly distributed across the ring and a disproportionate number of the randomly generated keys would land on a few instances. The goal here is to generate an even number of keys that can be re-used even between launches of the stress tool itself to generate "infinitely" wide/large CQL partitions!

          CREATE TABLE IF NOT EXISTS test_keyspace.largetext1 (name text, col1 text, value blob, primary key(name, col1)) WITH compaction = { 'class':'LeveledCompactionStrategy' }
          CREATE TABLE IF NOT EXISTS test_keyspace.largeuuid1 (name text, col1 uuid, value uuid, primary key(name, col1)) WITH compaction = { 'class':'LeveledCompactionStrategy' }
          
          "INSERT INTO test_keyspace.timeuuid1 (name, col1, value) VALUES (?, ?, ?)"
          "SELECT * FROM test_keyspace.timeuuid1 WHERE name = ? and col1 = ?"
          
          "INSERT INTO test_keyspace.largetext1 (name, col1, value) VALUES (?, ?, ?)"
          "SELECT * FROM test_keyspace.largetext1 WHERE name = ? and col1 = ?"
          

          The values that are generated for insert are generated lazily to allow us to insert large amounts of data without incurring impossible memory and CPU costs on the client/stress-tool side to attempt to generate them all up front (which is what attempting to configure a large partition with cassandra-stress will do and fail at). I then sample the randomly generated values per iterator at a given rate so that once I'm done inserting enough data to do a best effort at ensuring the memtable has been flushed (and so the read will come from the disk not the memtable) I can then iterate thru the samples and select those values and validate that the database is returning the same thing I know I inserted (to ensure replacing such a critical part of Cassandra's storage engine hasn't broken correctness – which is a paramount requirement above everything obviously).

          Now, Some Graphs!

          It's very easy to see the difference between the Birch and non-Birch (control) clusters. With Birch the read and write latencies are consistent, irregardless of the size of the CQL partitions that are being written and read from. GC counts are very low and when GC does run it's very short ParNew runs, not long STW CMS collections.

          In comparison, the control cluster without Birch shows a upward trend in latencies as the CQL partition size continues to grow. GC is very unpredictable with many (in terms of count) regular (and long in terms of duration) 200-300ms STW CMS pauses. Instances were also starting to frequently OOM while I was collecting statistics. This makes it hard to get good comparison data as the latencies and counts that the cluster can drive aren't predictable at all between instances restarting and randomly pausing for very extended lengths of time.

          p.s. Jira sucks and the "thumbnail" image option doesn't work and a quick Google seems to indicate it's a well known broken feature. Additionally, if you resize the images to display inline there is no way to link to the full image. So sorry – I tried my best with what tools I have lol.

          Read, Write Counts and Latencies, and Java JVM GC Statistics with Birch

             
          perf_cluster_1_with_birch_read_latency_and_counts.png
          perf_cluster_1_with_birch_write_latency_and_counts.png
          perf_cluster_2_with_birch_read_latency_and_counts.png
          perf_cluster_2_with_birch_write_latency_and_counts.png
          gc_collection_times_with_birch.png
          gc_counts_with_birch.png

          Read, Write Counts and Latencies, and Java JVM GC Statistics without Birch

             
          perf_cluster_3_without_birch_read_latency_and_counts.png
          perf_cluster_3_without_birch_write_latency_and_counts.png
          gc_collection_times_without_birch.png
          gc_counts_without_birch.png
          Show
          mkjellman Michael Kjellman added a comment - - edited Attaching an initial set of very rough graphs showing the last 12 hours of stress/performance testing that's been running. I apologize ahead of time for some of the graphs – I wanted to include the average, p99.9th, and count for all key metrics and in some cases some of the values overlapped and my graphing foo wasn't good enough to improve the readability. I'll take another pass when I get some time with the next round of performance testing. The "large" CQL partitions in all 3 clusters are currently (and during the duration of the test) between ~6GB and ~12.5GB, although I'm planning on running the stress/performance tests in all 3 clusters until the "large" CQL partitions hits ~50GB. The load was started in all 3 clusters (where all 3 were totally empty at start) at the same time – from the same stress tool code that I wrote specifically to realistically test Birch as after repeated attempts to generate a good workload with cassandra-stress I gave up. Some details of the stress tool and load that was being generated for these graphs is below. There are three read-write workloads being run to generate the load during these tests. I wrote the following two methods for my "simple-cassandra-stress" tool I threw together to generate keys that the worker-threads operate on. I'll refer to them below in terms of how the stress load is currently being generated. public static List<HashCode> generateRandomKeys( int number) { List<HashCode> keysToOperateOn = new ArrayList<>(); HashFunction hf = Hashing.murmur3_128(); for ( int i = 0; i < number; i++) { HashCode hashedKey = hf.newHasher().putLong(RANDOM_THREAD_LOCAL.get().nextInt(300000) + 1).hash(); keysToOperateOn.add(hashedKey); } return keysToOperateOn; } public static List<HashCode> generateEvenlySpacedPredictableKeys( int number, int offset, String seed, Cluster cluster) throws InvalidParameterException { Set<TokenRange> tokenRanges = cluster.getMetadata().getTokenRanges(); int numberOfKeysToGenerate = (number < tokenRanges.size()) ? tokenRanges.size() : number; Long [] tokens = new Long [numberOfKeysToGenerate]; int pos = 0; int numberOfSplits = (number <= tokenRanges.size()) ? 1 : (number / tokenRanges.size()) + 1; for (TokenRange tokenRange : tokenRanges) { for (TokenRange splitTokenRange : tokenRange.splitEvenly(numberOfSplits)) { if (pos >= tokens.length) break ; tokens[pos++] = ( Long ) splitTokenRange.getStart().getValue(); } if (pos >= tokens.length) break ; } HashCode[] randomKeys = new HashCode[tokens.length]; int pendingRandomKeys = tokens.length; while (pendingRandomKeys > 0) { for ( int i = offset; i < (offset + numberOfKeysToGenerate) * (number * 10); i++) { if (pendingRandomKeys <= 0) break ; HashFunction hf = Hashing.murmur3_128(); HashCode hashedKey = hf.newHasher().putString(seed, Charset.defaultCharset()).putInt(i).hash(); for ( int t = 0; t < tokens.length; t++) { if ((t + 1 == tokens.length && hashedKey.asLong() >= tokens[t]) || (hashedKey.asLong() >= tokens[t] && hashedKey.asLong() < tokens[t + 1])) { if (randomKeys[t] == null ) { randomKeys[t] = hashedKey; pendingRandomKeys--; } break ; } } } } return Arrays.asList(randomKeys); } There are 12 Cassandra instances in each performance/stress cluster running JDK 1.8_u74 with the CMS collector (obviously simplified) running with -Xms5G -Xmx5G -Xmn1G. The test keyspace is created with RF=3: CREATE KEYSPACE IF NOT EXISTS test_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': 3} Operations for test_keyspace.largeuuid1 generate a new key to insert and read from at the top of every iteration with generateRandomKeys(1). Each worker then generates 10,000 random mutations, with the current timeuuid and a random value blob of 30 bytes to 2kb. This is intended to get some more "normal" load on the cluster. CREATE TABLE IF NOT EXISTS test_keyspace.timeuuid1 (name text, col1 timeuuid, value blob, primary key(name, col1)) WITH compaction = { 'class':'LeveledCompactionStrategy' } "INSERT INTO test_keyspace.largeuuid1 (name, col1, value) VALUES (?, ?, ?)" " SELECT * FROM test_keyspace.largeuuid1 WHERE name = ? and col1 = ?" The second and third generated workload attempt to stress the large row size element of this work. The goal here is to create infinitely growing partitions. test_keyspace.largetext1 and test_keyspace.largeuuid1 are largely the same except that test_keyspace.largetext1 is intended to also stress the Overflow logic for large composite keys. A key design element of Birch is it's support for variable length keys. Cassandra supports row keys up to a maximum length of unsigned short. To have predictable performance in the tree implementation however, supporting keys of length unsigned short as first class citizens would aversely hurt the performance of the 99.999% of other normal sized keys. To support these large keys (but not hurt the performance of normal sized keys) a Birch node/leaf will contain up to ((size_of_leaf_node / 2) / 2), where size_of_leaf_node is 4kb by default and we divide by 2 to accommodate for serializing/inserting at least 2 elements in a single node. This results in a key of length <= 1kb being supported without any special handling which should cover the use cases of almost everyone in the world. For keys that exceed that length, the rest of the bytes are written into a single Overflow page which is shared between all inner + leaf nodes and is not page aligned. This means we will keep 1kb worth of the key (assuming a 4kb Birch node size) inside the node itself and the rest in the Overflow page. If we need to read that key we can grab the bytes from the node + overflow page inline during the tree operation and re-assemble the entire variable key. This has a slight performance cost (of course) as it requires the allocation of an additional byte[], an additional seek, and additional reads. To exercise this, col1 in test_keyspace.largetext1 is a randomly generated string from 300-4kb – and conversely to see the performance without the Overflow logic (what will almost always be the case in real life as row keys > 1kb are pretty ridiculous ) test_keyspace.largeuuid1 uses a simple randomly generated UUID for it's primary key. generateEvenlySpacedPredictableKeys() (see above) was written to generate a predicable set of pseudo-random keys (where the same seed will generate the same "random" keys). The logic is a bit complicated as I found that just randomly generating n-keys didn't guarantee the load would be evenly distributed across the ring and a disproportionate number of the randomly generated keys would land on a few instances. The goal here is to generate an even number of keys that can be re-used even between launches of the stress tool itself to generate "infinitely" wide/large CQL partitions! CREATE TABLE IF NOT EXISTS test_keyspace.largetext1 (name text, col1 text, value blob, primary key(name, col1)) WITH compaction = { 'class':'LeveledCompactionStrategy' } CREATE TABLE IF NOT EXISTS test_keyspace.largeuuid1 (name text, col1 uuid, value uuid, primary key(name, col1)) WITH compaction = { 'class':'LeveledCompactionStrategy' } "INSERT INTO test_keyspace.timeuuid1 (name, col1, value) VALUES (?, ?, ?)" " SELECT * FROM test_keyspace.timeuuid1 WHERE name = ? and col1 = ?" "INSERT INTO test_keyspace.largetext1 (name, col1, value) VALUES (?, ?, ?)" " SELECT * FROM test_keyspace.largetext1 WHERE name = ? and col1 = ?" The values that are generated for insert are generated lazily to allow us to insert large amounts of data without incurring impossible memory and CPU costs on the client/stress-tool side to attempt to generate them all up front (which is what attempting to configure a large partition with cassandra-stress will do and fail at). I then sample the randomly generated values per iterator at a given rate so that once I'm done inserting enough data to do a best effort at ensuring the memtable has been flushed (and so the read will come from the disk not the memtable) I can then iterate thru the samples and select those values and validate that the database is returning the same thing I know I inserted (to ensure replacing such a critical part of Cassandra's storage engine hasn't broken correctness – which is a paramount requirement above everything obviously). Now, Some Graphs! It's very easy to see the difference between the Birch and non-Birch (control) clusters. With Birch the read and write latencies are consistent, irregardless of the size of the CQL partitions that are being written and read from. GC counts are very low and when GC does run it's very short ParNew runs, not long STW CMS collections. In comparison, the control cluster without Birch shows a upward trend in latencies as the CQL partition size continues to grow. GC is very unpredictable with many (in terms of count) regular (and long in terms of duration) 200-300ms STW CMS pauses. Instances were also starting to frequently OOM while I was collecting statistics. This makes it hard to get good comparison data as the latencies and counts that the cluster can drive aren't predictable at all between instances restarting and randomly pausing for very extended lengths of time. p.s. Jira sucks and the "thumbnail" image option doesn't work and a quick Google seems to indicate it's a well known broken feature. Additionally, if you resize the images to display inline there is no way to link to the full image. So sorry – I tried my best with what tools I have lol. Read, Write Counts and Latencies, and Java JVM GC Statistics with Birch     perf_cluster_1_with_birch_read_latency_and_counts.png perf_cluster_1_with_birch_write_latency_and_counts.png perf_cluster_2_with_birch_read_latency_and_counts.png perf_cluster_2_with_birch_write_latency_and_counts.png gc_collection_times_with_birch.png gc_counts_with_birch.png Read, Write Counts and Latencies, and Java JVM GC Statistics without Birch     perf_cluster_3_without_birch_read_latency_and_counts.png perf_cluster_3_without_birch_write_latency_and_counts.png gc_collection_times_without_birch.png gc_counts_without_birch.png
          Hide
          mkjellman Michael Kjellman added a comment -

          There were some issues with my cherry-pick to my public GitHub branch. I started from scratch and squashed all 182 individual commits from scratch, rebased up to 2.1.16, and pushed to a new branch: https://github.com/mkjellman/cassandra/tree/CASSANDRA-9754-2.1-v2

          The full squashed 2.1 based patch is https://github.com/mkjellman/cassandra/commit/b17f2c1317326fac7b6864a2fc61d7ee2580f740

          Show
          mkjellman Michael Kjellman added a comment - There were some issues with my cherry-pick to my public GitHub branch. I started from scratch and squashed all 182 individual commits from scratch, rebased up to 2.1.16, and pushed to a new branch: https://github.com/mkjellman/cassandra/tree/CASSANDRA-9754-2.1-v2 The full squashed 2.1 based patch is https://github.com/mkjellman/cassandra/commit/b17f2c1317326fac7b6864a2fc61d7ee2580f740
          Hide
          mkjellman Michael Kjellman added a comment -

          Fixed: single squashed commit for 182 individual commits (wow didn't realize it was that many) just pushed to a new branch and rebased up to 2.1.16 https://github.com/mkjellman/cassandra/tree/CASSANDRA-9754-2.1-v2

          Show
          mkjellman Michael Kjellman added a comment - Fixed: single squashed commit for 182 individual commits (wow didn't realize it was that many) just pushed to a new branch and rebased up to 2.1.16 https://github.com/mkjellman/cassandra/tree/CASSANDRA-9754-2.1-v2
          Hide
          mkjellman Michael Kjellman added a comment -

          Yes, however something went wrong with the cherry-pick to the external github.com repo as caught by Jeff. I'm squashing all the changes now into a single commit and pushing a new branch up. Give me a few more moments.

          Show
          mkjellman Michael Kjellman added a comment - Yes, however something went wrong with the cherry-pick to the external github.com repo as caught by Jeff. I'm squashing all the changes now into a single commit and pushing a new branch up. Give me a few more moments.
          Hide
          blambov Branimir Lambov added a comment -

          Is it now ready for review?

          Show
          blambov Branimir Lambov added a comment - Is it now ready for review?
          Show
          mkjellman Michael Kjellman added a comment - Latest set of fixes pushed to https://github.com/mkjellman/cassandra/commit/5586be24f55a16887376cb244a7d1b1fa777927f
          Hide
          mkjellman Michael Kjellman added a comment -

          Stable all night! My large test partitions have grown to ~12.5GB Just as stable – latencies are unchanged. I'm so happy!!! ~7ms avg p99.9th and ~925 microseconds average read latency. GC basically non-existant – and for what GC is happening, the instances are averaging a 111 microsecond ParNew collection – almost NO CMS! Compaction is keeping up.

          On the converse side (the control 2.1 cluster running the same load) has instances are OOMing left and right – CMS is frequently running 250 ms collections, ParNew is running 1.28 times a second on average with 75 ms average ParNew times. Horrible! And that's average – the upper percentiles are a mess so I won't bore everyone. Read latencies are currently 380 ms average with many 15 second read latencies in the p99.9.

          Show
          mkjellman Michael Kjellman added a comment - Stable all night! My large test partitions have grown to ~12.5GB Just as stable – latencies are unchanged. I'm so happy!!! ~7ms avg p99.9th and ~925 microseconds average read latency. GC basically non-existant – and for what GC is happening, the instances are averaging a 111 microsecond ParNew collection – almost NO CMS! Compaction is keeping up. On the converse side (the control 2.1 cluster running the same load) has instances are OOMing left and right – CMS is frequently running 250 ms collections, ParNew is running 1.28 times a second on average with 75 ms average ParNew times. Horrible! And that's average – the upper percentiles are a mess so I won't bore everyone. Read latencies are currently 380 ms average with many 15 second read latencies in the p99.9.
          Hide
          mkjellman Michael Kjellman added a comment -

          I fixed the last (pretty nasty) bug today/tonight! The issue was in IndexedSliceReader#IndexedBlockFetcher where I was failing to properly initializing a new iterator to the given start of the slice for the read query. This caused every read to iterate over all indexed entries every time. Fortunately that bug had brought some performance concerns on the underlying read logic to my attention which I also addressed thinking that was the root cause.

          I'm currently running my performance/stress load in three separate performance clusters; two with a build that has Birch and one that is a control version of 2.1.16. I'm currently performing 700 reads per/sec per instance and 1.5k writes per/sec.

          Read Latencies in both Birch perf clusters are showing (at the storage proxy level) 838 microseconds latencies in the average percentile and only 7.4 milliseconds in the p99.9th!
          Write Latencies in both Birch perf clusters are showing (at the storage proxy level) 138 microseconds in the average percentile and 775 microseconds in the p99.9th!

          There is basically no GC to be spoken for and the latencies are very stable (and have been) for the past hour since I restarted the load with the fix for the Iterator as mentioned above.

          The best thing about all these stats above is many of the reads are occurring against a (currently) 8.5GB rows! The control cluster has latencies 7-8x the Birch clusters so far and GC is out of control and instances are starting to constantly OOM. It's hard to compare anything against the control cluster as things start to fall apart very significantly after the test CQL partitions grow above ~4GB.... eek.

          I'm going to let the load continue overnight to grow the partitions larger (I'm targeting 50GB for this first performance milestone).

          It's pretty hard to not be happy when you see these numbers. This could end up being very very epic for our little project. I'm pretty, pretty, pretty (okay *really() happy tonight!!

          Show
          mkjellman Michael Kjellman added a comment - I fixed the last (pretty nasty) bug today/tonight! The issue was in IndexedSliceReader#IndexedBlockFetcher where I was failing to properly initializing a new iterator to the given start of the slice for the read query. This caused every read to iterate over all indexed entries every time. Fortunately that bug had brought some performance concerns on the underlying read logic to my attention which I also addressed thinking that was the root cause. I'm currently running my performance/stress load in three separate performance clusters; two with a build that has Birch and one that is a control version of 2.1.16. I'm currently performing 700 reads per/sec per instance and 1.5k writes per/sec. Read Latencies in both Birch perf clusters are showing (at the storage proxy level) 838 microseconds latencies in the average percentile and only 7.4 milliseconds in the p99.9th! Write Latencies in both Birch perf clusters are showing (at the storage proxy level) 138 microseconds in the average percentile and 775 microseconds in the p99.9th! There is basically no GC to be spoken for and the latencies are very stable (and have been) for the past hour since I restarted the load with the fix for the Iterator as mentioned above. The best thing about all these stats above is many of the reads are occurring against a (currently) 8.5GB rows! The control cluster has latencies 7-8x the Birch clusters so far and GC is out of control and instances are starting to constantly OOM. It's hard to compare anything against the control cluster as things start to fall apart very significantly after the test CQL partitions grow above ~4GB.... eek. I'm going to let the load continue overnight to grow the partitions larger (I'm targeting 50GB for this first performance milestone). It's pretty hard to not be happy when you see these numbers. This could end up being very very epic for our little project. I'm pretty , pretty, pretty (okay *really() happy tonight!!
          Hide
          mkjellman Michael Kjellman added a comment -

          I just pushed up a squashed commit for the roughly 68 individual commits I've made while working on stability and performance over the past few weeks.

          https://github.com/mkjellman/cassandra/commit/41c6d43d0b020149a5564d4f7ab3c92e1bfcba64

          I'm currently writing up the findings from the latest stress test I've been running for the last 24 hours across 3 performance clusters and will update the ticket with that in a bit.

          Show
          mkjellman Michael Kjellman added a comment - I just pushed up a squashed commit for the roughly 68 individual commits I've made while working on stability and performance over the past few weeks. https://github.com/mkjellman/cassandra/commit/41c6d43d0b020149a5564d4f7ab3c92e1bfcba64 I'm currently writing up the findings from the latest stress test I've been running for the last 24 hours across 3 performance clusters and will update the ticket with that in a bit.
          Hide
          mkjellman Michael Kjellman added a comment -

          I'm a bit late on my Tuesday target I was aiming for on Saturday but for good reason I've been working almost non-stop since (went to bed at 3:30am and was up at 8:30am looking at graphs.. and I've been looking at graphs ever since). I have a performance load running in 3 perf clusters – I'd like to aggregate those objective findings tomorrow and then push up whatever the state of things is (it's very stable so I'm pretty pumped about that) along with some benchmarks (the good and possibly bad/still needs improvement).

          Show
          mkjellman Michael Kjellman added a comment - I'm a bit late on my Tuesday target I was aiming for on Saturday but for good reason I've been working almost non-stop since (went to bed at 3:30am and was up at 8:30am looking at graphs.. and I've been looking at graphs ever since). I have a performance load running in 3 perf clusters – I'd like to aggregate those objective findings tomorrow and then push up whatever the state of things is (it's very stable so I'm pretty pumped about that) along with some benchmarks (the good and possibly bad/still needs improvement).
          Hide
          blambov Branimir Lambov added a comment -

          if we mmap a few times we'll still incur the very high and unpredictable costs from mmap

          The MmappedRegions usage is to map the regions at sstable load, i.e. effectively only once in the table's lifecycle, which should completely avoid any mmap costs at read time.

          I'm wondering though if mmap'ing things even makes since

          Depends if we want to squeeze the last bit of performance or not. Memmapped data (assuming already mapped as above) that resides in the page cache has no cost whatsoever to be accessed, while reading it off RAF or a channel still needs a system call plus some copying. The difference is fest most on workloads that fit entirely in the page cache.

          If you don't feel like this is helpful, you can leave this out of the 2.1 version and rely on Rebufferer (or RandomAccessReader) to do memmapping or caching for you in trunk.

          Show
          blambov Branimir Lambov added a comment - if we mmap a few times we'll still incur the very high and unpredictable costs from mmap The MmappedRegions usage is to map the regions at sstable load, i.e. effectively only once in the table's lifecycle, which should completely avoid any mmap costs at read time. I'm wondering though if mmap'ing things even makes since Depends if we want to squeeze the last bit of performance or not. Memmapped data (assuming already mapped as above) that resides in the page cache has no cost whatsoever to be accessed, while reading it off RAF or a channel still needs a system call plus some copying. The difference is fest most on workloads that fit entirely in the page cache. If you don't feel like this is helpful, you can leave this out of the 2.1 version and rely on Rebufferer (or RandomAccessReader ) to do memmapping or caching for you in trunk.
          Hide
          mkjellman Michael Kjellman added a comment -

          I saw that way back when I started implementing things – I'm wondering though if mmap'ing things even makes since. Given the chunks of work are aligned on 4k boundaries, even if we mmap a few times we'll still incur the very high and unpredictable costs from mmap (even if less given 2GB chunks are obviously much bigger than 4kb)... thoughts? I'm trying to profile it now...

          Show
          mkjellman Michael Kjellman added a comment - I saw that way back when I started implementing things – I'm wondering though if mmap'ing things even makes since. Given the chunks of work are aligned on 4k boundaries, even if we mmap a few times we'll still incur the very high and unpredictable costs from mmap (even if less given 2GB chunks are obviously much bigger than 4kb)... thoughts? I'm trying to profile it now...
          Hide
          blambov Branimir Lambov added a comment -

          > Originally, I was mmapping 4kb aligned chunks as necessary.

          Cassandra has some machinery to deal with the same problem in RandomAccessReader; the solution we have in place is to map the entire file in <2GB chunks and look the chunk up on a read. Take a look at MmappedRegions in trunk and its users.

          Show
          blambov Branimir Lambov added a comment - > Originally, I was mmapping 4kb aligned chunks as necessary. Cassandra has some machinery to deal with the same problem in RandomAccessReader ; the solution we have in place is to map the entire file in <2GB chunks and look the chunk up on a read. Take a look at MmappedRegions in trunk and its users.
          Hide
          mkjellman Michael Kjellman added a comment -

          Oh, another very important update. Originally, I was mmapping 4kb aligned chunks as necessary. When I finally got things stable due to a few file descriptor leaks and fun fighting Java with MemoryByteBuffer objects I ran the performance load from the stress tool I wrote and found the performance was randomly terrible (like 1.3 SECONDS in the 99.9th percentile). Upon investigation and a ton instrumentation I found mmap calls were taking 90+ms in the 99th percentile and 70+ms in the 90th percentile on the hardware I'm using for performance testing. I looked into the JDK source code to figure out if there were any synchronized blocks in the native code but it's pretty sane and just calls the mmap syscall. Discussed it a bit with Norman Maurer and we both came up pretty shocked that mmap could be that slow! These boxes have 256GB of RAM and there was basically zero disk IO as everything was in the page cache as expected. There were a lot of major page faults but really very very surprising mmap can be so horrible in the upper percentiles.

          I ripped out all the mmap logic on the read path and switched to directly reading from the RAF from the aligned 4kb chunks as needed and everything looked amazing.

          Show
          mkjellman Michael Kjellman added a comment - Oh, another very important update. Originally, I was mmapping 4kb aligned chunks as necessary. When I finally got things stable due to a few file descriptor leaks and fun fighting Java with MemoryByteBuffer objects I ran the performance load from the stress tool I wrote and found the performance was randomly terrible (like 1.3 SECONDS in the 99.9th percentile). Upon investigation and a ton instrumentation I found mmap calls were taking 90+ms in the 99th percentile and 70+ms in the 90th percentile on the hardware I'm using for performance testing. I looked into the JDK source code to figure out if there were any synchronized blocks in the native code but it's pretty sane and just calls the mmap syscall. Discussed it a bit with Norman Maurer and we both came up pretty shocked that mmap could be that slow! These boxes have 256GB of RAM and there was basically zero disk IO as everything was in the page cache as expected. There were a lot of major page faults but really very very surprising mmap can be so horrible in the upper percentiles. I ripped out all the mmap logic on the read path and switched to directly reading from the RAF from the aligned 4kb chunks as needed and everything looked amazing.
          Hide
          mkjellman Michael Kjellman added a comment -

          Wanted to post a quick update on the ticket. I've been working pretty much around the clock for the last two weeks on stabilizing, performance testing, validating, and bug fixing the code. I had an unfortunate unexpected death in my family last week so I lost the better part of this past week tying up the last pieces I was finishing up before I got the bad news.

          After attempting to work with a few people in the community to get cassandra-stress working in a way that actually stresses large partitions and validates the data written into it, I ended up needing to write a stress tool. I loaded up a few hundred 30GB+ partitions with column sizes of 300-2048 bytes while constantly reading data that was sampled during the inserts to make sure I'm not returning bad data or incorrect results.

          I ran the most recent load for ~2 days in a small performance cluster and there were no validation errors. Additionally, I'm running the exact same stress/perf load in another identical cluster with a 2.1 build that does not contain Birch. This is allowing me to make objective A/B comparisons between the two builds.

          The build is stable, there are no exceptions or errors in the logs even under pretty high load (the instances are doing 3x the load we generally run at in production) and most importantly GC is very stable. In contrast, GC starts off great without Birch but around the time the large partitions generated by the stress tool reached ~250MB GC shot up and then started increasing literally as the row increased (as expected). In contrast, the cluster with the Birch build had no change in GC as the size of the partitions increased.

          I was a bit disappointed with some of the latencies I saw on reads in the upper percentiles and so I've identified what I'm almost positive was the cause and just finished up refactoring the logic for serializing/deserializing the aligned segments and subsegments in PageAlignedWriter/PageAlignedReader.

          I'm cleaning up the commit now and then going to get it into the perf cluster to start another load. If that looks good hoping to push all the stability and performance changes I've made up to my public Github branch most likely Tuesday as I'd like to let the performance load run for 2 days to build up large enough partitions to accurately stress and test things.

          Show
          mkjellman Michael Kjellman added a comment - Wanted to post a quick update on the ticket. I've been working pretty much around the clock for the last two weeks on stabilizing, performance testing, validating, and bug fixing the code. I had an unfortunate unexpected death in my family last week so I lost the better part of this past week tying up the last pieces I was finishing up before I got the bad news. After attempting to work with a few people in the community to get cassandra-stress working in a way that actually stresses large partitions and validates the data written into it, I ended up needing to write a stress tool. I loaded up a few hundred 30GB+ partitions with column sizes of 300-2048 bytes while constantly reading data that was sampled during the inserts to make sure I'm not returning bad data or incorrect results. I ran the most recent load for ~2 days in a small performance cluster and there were no validation errors. Additionally, I'm running the exact same stress/perf load in another identical cluster with a 2.1 build that does not contain Birch. This is allowing me to make objective A/B comparisons between the two builds. The build is stable, there are no exceptions or errors in the logs even under pretty high load (the instances are doing 3x the load we generally run at in production) and most importantly GC is very stable. In contrast, GC starts off great without Birch but around the time the large partitions generated by the stress tool reached ~250MB GC shot up and then started increasing literally as the row increased (as expected). In contrast, the cluster with the Birch build had no change in GC as the size of the partitions increased. I was a bit disappointed with some of the latencies I saw on reads in the upper percentiles and so I've identified what I'm almost positive was the cause and just finished up refactoring the logic for serializing/deserializing the aligned segments and subsegments in PageAlignedWriter/PageAlignedReader. I'm cleaning up the commit now and then going to get it into the perf cluster to start another load. If that looks good hoping to push all the stability and performance changes I've made up to my public Github branch most likely Tuesday as I'd like to let the performance load run for 2 days to build up large enough partitions to accurately stress and test things.
          Hide
          mkjellman Michael Kjellman added a comment -

          Over the past few days I've made some really great progress. I have the 2.1 based implementation (as found at https://github.com/mkjellman/cassandra/commits/CASSANDRA-9754-2.1) in a temporary performance cluster running stably against cassandra-stress.

          I found a few issues that I've been fixing as I find them while running the code under load:

          • Fix reading of non-birch indexes from SSTableScanner
          • Force un-mmapping of the current mmapped buffer from a PageAlignedReader before mmapping a new region
          • Fix alignTo() issues when using anything other than 4096 padding for indexes (e.g. 2048)
          • Make Birch/PageAligned Format padding length configurable (sstable_index_segment_padding_in_kb)
          • Fix signing issue when serializing and deserializing an unsigned short
          • Use a reusable buffer in PageAlignedWriter
          • Fix an issue where the index of the current subsegment was being used when the index of the current segment should have been used
          • Other minor cleanup, spelling nits, etc

          I've observed a bug where a java.nio.BufferUnderflowException is sometimes thrown under load from a ValidationExecutor thread while doing a repair. I've put some temporary logging in to dump the state of the reader when the exception happens but I'm still not sure how it gets into that state. Wondering if there is some kind of concurrency problem somewhere?

          Also, (although obvious in hindsight) the page alignment to keep segments aligned on 4kb boundaries causes an unacceptable write amplificiation for the size of the index file for workloads with small row keys and < 64kb of data in the row (a.k.a. no index). I've been discussing with a few people the various options we have and the tradeoffs for each one of them. Hoping to formalize those thoughts and implement something today or tomorrow.

          So, all and all, the 2.1 based implementation is really stabilizing and initial performance tests are looking very encouraging!

          Show
          mkjellman Michael Kjellman added a comment - Over the past few days I've made some really great progress. I have the 2.1 based implementation (as found at https://github.com/mkjellman/cassandra/commits/CASSANDRA-9754-2.1 ) in a temporary performance cluster running stably against cassandra-stress. I found a few issues that I've been fixing as I find them while running the code under load: Fix reading of non-birch indexes from SSTableScanner Force un-mmapping of the current mmapped buffer from a PageAlignedReader before mmapping a new region Fix alignTo() issues when using anything other than 4096 padding for indexes (e.g. 2048) Make Birch/PageAligned Format padding length configurable (sstable_index_segment_padding_in_kb) Fix signing issue when serializing and deserializing an unsigned short Use a reusable buffer in PageAlignedWriter Fix an issue where the index of the current subsegment was being used when the index of the current segment should have been used Other minor cleanup, spelling nits, etc I've observed a bug where a java.nio.BufferUnderflowException is sometimes thrown under load from a ValidationExecutor thread while doing a repair. I've put some temporary logging in to dump the state of the reader when the exception happens but I'm still not sure how it gets into that state. Wondering if there is some kind of concurrency problem somewhere? Also, (although obvious in hindsight) the page alignment to keep segments aligned on 4kb boundaries causes an unacceptable write amplificiation for the size of the index file for workloads with small row keys and < 64kb of data in the row (a.k.a. no index). I've been discussing with a few people the various options we have and the tradeoffs for each one of them. Hoping to formalize those thoughts and implement something today or tomorrow. So, all and all, the 2.1 based implementation is really stabilizing and initial performance tests are looking very encouraging!
          Hide
          mkjellman Michael Kjellman added a comment - - edited

          I've discovered a performance regression caused by the original logic in PageAlignedReader. I always knew the original design wasn't ideal, however, I felt that the additional code complexity wasn't worth the performance improvements. However, now that the code is stabilized and I've moved on to performance validation (and not just bugs and implementation) I found it was horribly inefficient.

          https://github.com/mkjellman/cassandra/commit/b4b3152ec7d92d85c032cfbcbfae705e9dc36989

          I've updated the documentation in PageAlignedWriter to cover the new PageAligned file format. The new implementation allows lazy deserialization of segment metadata as required, and enables binary search across segments via the fixed length starting offsets. This means deserialization of the segments are no longer required ahead of time – deserialization of the segment metadata only occurs when required to return a result.

          Initial benchmarking and profiling makes me a pretty happy guy. I think the new design is a massive improvement over the old one and looks pretty good so far.

          Show
          mkjellman Michael Kjellman added a comment - - edited I've discovered a performance regression caused by the original logic in PageAlignedReader. I always knew the original design wasn't ideal, however, I felt that the additional code complexity wasn't worth the performance improvements. However, now that the code is stabilized and I've moved on to performance validation (and not just bugs and implementation) I found it was horribly inefficient. https://github.com/mkjellman/cassandra/commit/b4b3152ec7d92d85c032cfbcbfae705e9dc36989 I've updated the documentation in PageAlignedWriter to cover the new PageAligned file format. The new implementation allows lazy deserialization of segment metadata as required, and enables binary search across segments via the fixed length starting offsets. This means deserialization of the segments are no longer required ahead of time – deserialization of the segment metadata only occurs when required to return a result. Initial benchmarking and profiling makes me a pretty happy guy. I think the new design is a massive improvement over the old one and looks pretty good so far.
          Hide
          mkjellman Michael Kjellman added a comment - - edited

          I pushed a rebased commit that addresses many additional comments by Jason Brown from review, adds additional unit tests, and has many further improvements to documentation. This is still 2.1 based, however the review and improvements made in the org.apache.cassandra.db.index.birch package is agnostic to a trunk or 2.1 based patch.

          https://github.com/mkjellman/cassandra/tree/CASSANDRA-9754-2.1

          Some Highlights:

          • Fix a bug in KeyIterator identified by Jeff Jirsa that would cause the iterator to return nothing when the backing SegmentedFile contains exactly 1 key/segment.
          • Add unit tests for KeyIterator
          • Add SSTable version ka support to LegacySSTableTest. Actually test something in LegacySSTableTest.
          • Add additional unit tests around PageAlignedReader, PageAlignedWriter, BirchWriter, and BirchReader
          • Remove word lists and refactor all unit tests to use TimeUUIDTreeSerializableIterator instead
          • Improve documentation and fix documentation as required to properly parse and format during javadoc creation
          • Remove reset() functionality from BirchReader.BirchIterator
          • Fix many other nits
          Show
          mkjellman Michael Kjellman added a comment - - edited I pushed a rebased commit that addresses many additional comments by Jason Brown from review, adds additional unit tests, and has many further improvements to documentation. This is still 2.1 based, however the review and improvements made in the org.apache.cassandra.db.index.birch package is agnostic to a trunk or 2.1 based patch. https://github.com/mkjellman/cassandra/tree/CASSANDRA-9754-2.1 Some Highlights: Fix a bug in KeyIterator identified by Jeff Jirsa that would cause the iterator to return nothing when the backing SegmentedFile contains exactly 1 key/segment. Add unit tests for KeyIterator Add SSTable version ka support to LegacySSTableTest. Actually test something in LegacySSTableTest. Add additional unit tests around PageAlignedReader, PageAlignedWriter, BirchWriter, and BirchReader Remove word lists and refactor all unit tests to use TimeUUIDTreeSerializableIterator instead Improve documentation and fix documentation as required to properly parse and format during javadoc creation Remove reset() functionality from BirchReader.BirchIterator Fix many other nits
          Hide
          mkjellman Michael Kjellman added a comment - - edited

          So, I'm mostly done with a trunk version of the patch, however, I'm currently focusing on finishing and polishing the 2.1 based version. Although the abstraction of the index is almost a total rewrite between 2.1 and trunk, the tree itself (aka the Birch implementation) should remain almost identical so this certainly isn't wasted time for anyone. I've cleaned up the implementation a bunch, taken care of a bunch of todos and low hanging fruit, added more documentation, and pushed it to Github to make it a bit easier to make sure the changes apply cleanly.

          https://github.com/mkjellman/cassandra/commit/e5389378b19eb03de7dd4d50d6df110c68057985

          The following 4 unit tests (out of 1184) are still failing (so close!):

          • org.apache.cassandra.cql3.KeyCacheCqlTest (2 of 2). Need to talk to Ariel Weisberg to understand exactly what these unit tests are testing.
          • org.apache.cassandra.db.ColumnFamilyStoreTest (2 of 38, both related to secondary indexes)

          Tomorrow, I hope to push a patch addressing the feedback from Branimir Lambov (see above comment) above along with any changes that come out of the code review currently underway by Jason Brown and sankalp kohli. I also need/want to do some work on feeling more comfortable on the upgrade/backwards compatibility story and make sure there is a good unit test story around that.

          Jeff Jirsa if you get a chance to take a look please let me know if you have any initial feedback that would be awesome!

          Show
          mkjellman Michael Kjellman added a comment - - edited So, I'm mostly done with a trunk version of the patch, however, I'm currently focusing on finishing and polishing the 2.1 based version. Although the abstraction of the index is almost a total rewrite between 2.1 and trunk, the tree itself (aka the Birch implementation) should remain almost identical so this certainly isn't wasted time for anyone. I've cleaned up the implementation a bunch, taken care of a bunch of todos and low hanging fruit, added more documentation, and pushed it to Github to make it a bit easier to make sure the changes apply cleanly. https://github.com/mkjellman/cassandra/commit/e5389378b19eb03de7dd4d50d6df110c68057985 The following 4 unit tests (out of 1184) are still failing (so close!): org.apache.cassandra.cql3.KeyCacheCqlTest (2 of 2). Need to talk to Ariel Weisberg to understand exactly what these unit tests are testing. org.apache.cassandra.db.ColumnFamilyStoreTest (2 of 38, both related to secondary indexes) Tomorrow, I hope to push a patch addressing the feedback from Branimir Lambov (see above comment) above along with any changes that come out of the code review currently underway by Jason Brown and sankalp kohli . I also need/want to do some work on feeling more comfortable on the upgrade/backwards compatibility story and make sure there is a good unit test story around that. Jeff Jirsa if you get a chance to take a look please let me know if you have any initial feedback that would be awesome!
          Hide
          mkjellman Michael Kjellman added a comment -

          Thanks Branimir for starting to review the code!!

          1) Yes, good optimization. The first most entry in the tree (so the first element in the root node, doesn't need to be repeated in the inner nodes as you can assume that you always go to the left.
          2) I think we can make the assumption that if the length of the bytes for a given entry's key is greater than or equal to the max length for a given node/leaf, we can have the code make an assumption that there will be a int encoded after the key bytes to the offset in the overflow page. If the key happens to be equal to the max length (but doesn't actually overflow) then we can encode 0 or something and then have the code know that value means no overflow page. The downside here is adding more assumptions and complexity to the code vs eating the single byte in the overflow page.
          3) Could do this but I think it increases the complexity as the root node would need to be special cased for max size.. currnently the root node, inner nodes, and leaf nodes all use the exact same code, where only the value serializer is different (to either write the offset to the next node/leaf or the actual value of the entry you want to add to the tree). We'd need to subtract the side of the descriptor from the available side of the aligned root node and then know where to seek inside it.

          In regards to your final comment: while technically we could build the root/inner nodes to point to an offset of indexinfo entries serialized in the current format, I think the tradeoffs make it non-ideal. If we serialize the entries in the leaf like I'm currently doing we a) get the benefits of the leaf being aligned and padded and b) with the proposed leaf format, we can binary search inside each leaf node to get the exact entry, vs having an inner node just point to an offset and then needing to linearly scan until we hit a match

          I'm almost done with a refactor for trunk. We found some pretty serious regressions in 2.1 that required my immediate attention but I hope to have a trunk based patch with your initial suggestions from above incorporated into the tree implementation very soon.

          Show
          mkjellman Michael Kjellman added a comment - Thanks Branimir for starting to review the code!! 1) Yes, good optimization. The first most entry in the tree (so the first element in the root node, doesn't need to be repeated in the inner nodes as you can assume that you always go to the left. 2) I think we can make the assumption that if the length of the bytes for a given entry's key is greater than or equal to the max length for a given node/leaf, we can have the code make an assumption that there will be a int encoded after the key bytes to the offset in the overflow page. If the key happens to be equal to the max length (but doesn't actually overflow) then we can encode 0 or something and then have the code know that value means no overflow page. The downside here is adding more assumptions and complexity to the code vs eating the single byte in the overflow page. 3) Could do this but I think it increases the complexity as the root node would need to be special cased for max size.. currnently the root node, inner nodes, and leaf nodes all use the exact same code, where only the value serializer is different (to either write the offset to the next node/leaf or the actual value of the entry you want to add to the tree). We'd need to subtract the side of the descriptor from the available side of the aligned root node and then know where to seek inside it. In regards to your final comment: while technically we could build the root/inner nodes to point to an offset of indexinfo entries serialized in the current format, I think the tradeoffs make it non-ideal. If we serialize the entries in the leaf like I'm currently doing we a) get the benefits of the leaf being aligned and padded and b) with the proposed leaf format, we can binary search inside each leaf node to get the exact entry, vs having an inner node just point to an offset and then needing to linearly scan until we hit a match I'm almost done with a refactor for trunk. We found some pretty serious regressions in 2.1 that required my immediate attention but I hope to have a trunk based patch with your initial suggestions from above incorporated into the tree implementation very soon.
          Hide
          blambov Branimir Lambov added a comment -

          I spent some time reading up BirchReader to figure out the nuts and bolts of how the storage works. I think we can squeeze a little more efficiency into the structure:

          • As far as I could see, your current implementation places a lot of copies on the lower side of each span in the non-leaf nodes (for example, the lowest key of the partition is present in the leaf node, its parent as well as all parents leading all the way to the root). This should not be necessary, simply omitting the first key (but retaining the child pointer) from all intermediate nodes and adding 1 to what the binary search returns will achieve the same result.
          • I find the overlow flag (and jumping back and forth to read it) less efficient than necessary. If we assume instead that key length equal to the max always entails overflow data, we would be using less space and be more efficient in the common case, while having a very low chance of taking a few bytes more in the uncommon situation of long keys.
          • Root node could be in the same page with descriptor (it is usually smaller so high chance to fit). Perhaps overflow is best placed elsewhere?

          More generally (ignoring padding on the leaves which is not necessarily always beneficial), the B+ structure you have built is practically a B-Tree index over a linear list of index entries. As we already have a linear list of IndexInfo structures in the current format, what are we gaining by not just building a B-Tree index over that? To me the latter would appear to be less complicated and much more generic with immediate possible applications in other parts of the codebase.

          Show
          blambov Branimir Lambov added a comment - I spent some time reading up BirchReader to figure out the nuts and bolts of how the storage works. I think we can squeeze a little more efficiency into the structure: As far as I could see, your current implementation places a lot of copies on the lower side of each span in the non-leaf nodes (for example, the lowest key of the partition is present in the leaf node, its parent as well as all parents leading all the way to the root). This should not be necessary, simply omitting the first key (but retaining the child pointer) from all intermediate nodes and adding 1 to what the binary search returns will achieve the same result. I find the overlow flag (and jumping back and forth to read it) less efficient than necessary. If we assume instead that key length equal to the max always entails overflow data, we would be using less space and be more efficient in the common case, while having a very low chance of taking a few bytes more in the uncommon situation of long keys. Root node could be in the same page with descriptor (it is usually smaller so high chance to fit). Perhaps overflow is best placed elsewhere? More generally (ignoring padding on the leaves which is not necessarily always beneficial), the B+ structure you have built is practically a B-Tree index over a linear list of index entries. As we already have a linear list of IndexInfo structures in the current format, what are we gaining by not just building a B-Tree index over that? To me the latter would appear to be less complicated and much more generic with immediate possible applications in other parts of the codebase.
          Hide
          mkjellman Michael Kjellman added a comment -

          T Jake Luciani will do!

          Show
          mkjellman Michael Kjellman added a comment - T Jake Luciani will do!
          Hide
          tjake T Jake Luciani added a comment - - edited

          Michael Kjellman for a > 2.1 patch, take a look at CASSANDRA-7443 which added an abstraction for IndexEntry and serializers which should hopefully be similar to what you did for this 2.1 version.

          Show
          tjake T Jake Luciani added a comment - - edited Michael Kjellman for a > 2.1 patch, take a look at CASSANDRA-7443 which added an abstraction for IndexEntry and serializers which should hopefully be similar to what you did for this 2.1 version.
          Hide
          mkjellman Michael Kjellman added a comment -

          Some additional thoughts while I'm thinking about them:

          • PageAlignedReader currently deserializes all the segments in the constructor. It might be more efficient to lazily deserialize the segments as we need the segment. I'm sure perf testing will quickly make it clear if the extra code complexity is worth the potential performance trade-off...
          • I picked 4kb for the page size based on an educated guess, but obviously other sizes need to be tested (less? more?)
          Show
          mkjellman Michael Kjellman added a comment - Some additional thoughts while I'm thinking about them: PageAlignedReader currently deserializes all the segments in the constructor. It might be more efficient to lazily deserialize the segments as we need the segment. I'm sure perf testing will quickly make it clear if the extra code complexity is worth the potential performance trade-off... I picked 4kb for the page size based on an educated guess, but obviously other sizes need to be tested (less? more?)
          Hide
          mkjellman Michael Kjellman added a comment -

          TIL: Attempting to upload to Jira via the slow and overpriced Gogo in-flight wifi doesn't work... "Cannot attach file 9754_part2-v1.diff: Unable to communicate with JIRA." Working on it..

          Show
          mkjellman Michael Kjellman added a comment - TIL: Attempting to upload to Jira via the slow and overpriced Gogo in-flight wifi doesn't work... "Cannot attach file 9754_part2-v1.diff: Unable to communicate with JIRA." Working on it..
          Hide
          jbellis Jonathan Ellis added a comment -

          Delighted to see this patch land, looking forward to getting it merged!

          Show
          jbellis Jonathan Ellis added a comment - Delighted to see this patch land, looking forward to getting it merged!
          Hide
          mkjellman Michael Kjellman added a comment -

          Alright, happy to finally be able to write this. I'm attaching a v1 diff containing Birch!

          Why is it named Birch?

          B+ Tree -> Trees that start with the letter B -> Birch... get it? haha...

          Description

          Birch is a B+ish/inspired tree aimed at improving the performance of the SSTable index in Cassandra (especially with large partitions).

          The existing implementation scales poorly with the size of the index/ row as the entire index must be deserialized onto the heap even to look for a single element. This puts significant pressure on the heap, where one read to a large partition will cause at the minimum a long painful CMS GC pause or – in the worst case – an OOM.

          The Birch implementation has a predictable fixed cost for reads at the expense of the additional on disk overhead for the tree itself – with an implementation that is the same complexity O(log) as the existing implementation. Every row added to the SSTable is also added to the primary index. If the size of the row is greater than 64kb we build an index (otherwise we just encode the position in the sstable for that row). All entries encoded into the index are page aligned and padded to the nearest boundary (4096 bytes by default). Every segment can be marked as either internally padded/aligned along a boundary or non-padded/aligned (up to 2GB). Birch indexes are aligned into 4096 byte nodes (both leaf and inner). Keys will be encoded inside the node itself, unless they exceed the size of the node/2. In that case, the size of the node/2 is encoded into the node itself and the offset of the remaining bytes in the overflow page is encoded. This enables predictable fixed performance of the tree, but accommodates variable length keys/elements.

          Notes on v1 of the diff (in no particular order)

          • I broke the changes into two logical parts: The first abstracts out the existing Index implementation and adds no new logic. The second includes a IndexedEntry implementation backed by a Birch tree.
          • The attached v1 patch is written for 2.1, I have already started rebasing the patch onto trunk and hope to finish that shortly and post a the trunk based patch
          • There's some high level Javadoc documentation in BirchWriter and PageAlignedWriter on the layout of the tree on disk, serialization and deserialization paths, and higher level goals of the classes
          • The next steps are to start getting feedback from reviews and the community. I also have profiled the tree itself but profiling the tree integrated into the stack and optimizing non-performant code paths is next (after the immediate task to rebase the change onto trunk)
          • There are still a few todo's I've left in regards to handling backwards compatibility, parts of the code I expect might be non-performant, and things I'd like to discuss on the "correct" implementation/behavior etc
          • I have a few unit tests that still don't pass and still need to be root caused... I've taken the approach this entire time that the unit tests shouldn't be touched to pass, so there is still a few behavioral regressions I've accidentally introduced. The current failing tests are:
            • AutoSavingCacheTest
            • SecondaryIndexTest
            • BatchlogManagerTest
            • KeyCacheTest
            • ScrubTest
            • IndexSummaryManagerTest
            • LegacySSTableTest
            • MultiSliceTest
          • I need to write a unit test to test reading the legacy/existing primary index implementation
          • By the nature of the index's role in the database, the unit test coverage is actually pretty extensive as any read and write touches the index in some capacity

          I'll be giving a talk at NGCC tomorrow (Thursday the 9th) to go over the high level design I ended up with and considerations I had to take into account once I actually got deep inside this part of the code.

          Looking forward to feedback!

          Show
          mkjellman Michael Kjellman added a comment - Alright, happy to finally be able to write this. I'm attaching a v1 diff containing Birch! Why is it named Birch? B+ Tree -> Trees that start with the letter B -> Birch... get it? haha... Description Birch is a B+ish/inspired tree aimed at improving the performance of the SSTable index in Cassandra (especially with large partitions). The existing implementation scales poorly with the size of the index/ row as the entire index must be deserialized onto the heap even to look for a single element. This puts significant pressure on the heap, where one read to a large partition will cause at the minimum a long painful CMS GC pause or – in the worst case – an OOM. The Birch implementation has a predictable fixed cost for reads at the expense of the additional on disk overhead for the tree itself – with an implementation that is the same complexity O(log ) as the existing implementation. Every row added to the SSTable is also added to the primary index. If the size of the row is greater than 64kb we build an index (otherwise we just encode the position in the sstable for that row). All entries encoded into the index are page aligned and padded to the nearest boundary (4096 bytes by default). Every segment can be marked as either internally padded/aligned along a boundary or non-padded/aligned (up to 2GB). Birch indexes are aligned into 4096 byte nodes (both leaf and inner). Keys will be encoded inside the node itself, unless they exceed the size of the node/2. In that case, the size of the node/2 is encoded into the node itself and the offset of the remaining bytes in the overflow page is encoded. This enables predictable fixed performance of the tree, but accommodates variable length keys/elements. Notes on v1 of the diff (in no particular order) I broke the changes into two logical parts: The first abstracts out the existing Index implementation and adds no new logic. The second includes a IndexedEntry implementation backed by a Birch tree. The attached v1 patch is written for 2.1, I have already started rebasing the patch onto trunk and hope to finish that shortly and post a the trunk based patch There's some high level Javadoc documentation in BirchWriter and PageAlignedWriter on the layout of the tree on disk, serialization and deserialization paths, and higher level goals of the classes The next steps are to start getting feedback from reviews and the community. I also have profiled the tree itself but profiling the tree integrated into the stack and optimizing non-performant code paths is next (after the immediate task to rebase the change onto trunk) There are still a few todo's I've left in regards to handling backwards compatibility, parts of the code I expect might be non-performant, and things I'd like to discuss on the "correct" implementation/behavior etc I have a few unit tests that still don't pass and still need to be root caused... I've taken the approach this entire time that the unit tests shouldn't be touched to pass, so there is still a few behavioral regressions I've accidentally introduced. The current failing tests are: AutoSavingCacheTest SecondaryIndexTest BatchlogManagerTest KeyCacheTest ScrubTest IndexSummaryManagerTest LegacySSTableTest MultiSliceTest I need to write a unit test to test reading the legacy/existing primary index implementation By the nature of the index's role in the database, the unit test coverage is actually pretty extensive as any read and write touches the index in some capacity I'll be giving a talk at NGCC tomorrow (Thursday the 9th) to go over the high level design I ended up with and considerations I had to take into account once I actually got deep inside this part of the code. Looking forward to feedback!
          Hide
          jkrupan Jack Krupansky added a comment -

          Any idea how a new wide partition will perform relative to the same amount of data and same number of clustering rows divided into bucketed partitions? For example, a single 1 GB wide partition vs. ten 100 MB partitions (same partition key plus a 0-9 bucket number) vs. a hundred 10 MB partitions (0-99 bucket number), for two access patterns: 1) random access a row or short slice, and 2) a full bulk read of the 1 GB of data, one moderate slice at a time.

          Or maybe the question is equivalent to asking what the cost is to access the last row of the 1 GB partition vs. the last row of the tenth or hundredth bucket of the bucketed equivalent.

          No precision required. Just inquiring whether we can get rid of bucketing as a preferred data modeling strategy, at least for the common use cases where the sum of the buckets is roughly 2 GB or less..

          The bucketing approach does have the side effect of distributing the buckets around the cluster, which could be a good thing, or maybe not.

          Show
          jkrupan Jack Krupansky added a comment - Any idea how a new wide partition will perform relative to the same amount of data and same number of clustering rows divided into bucketed partitions? For example, a single 1 GB wide partition vs. ten 100 MB partitions (same partition key plus a 0-9 bucket number) vs. a hundred 10 MB partitions (0-99 bucket number), for two access patterns: 1) random access a row or short slice, and 2) a full bulk read of the 1 GB of data, one moderate slice at a time. Or maybe the question is equivalent to asking what the cost is to access the last row of the 1 GB partition vs. the last row of the tenth or hundredth bucket of the bucketed equivalent. No precision required. Just inquiring whether we can get rid of bucketing as a preferred data modeling strategy, at least for the common use cases where the sum of the buckets is roughly 2 GB or less.. The bucketing approach does have the side effect of distributing the buckets around the cluster, which could be a good thing, or maybe not.
          Hide
          mkjellman Michael Kjellman added a comment -

          I have some insanely encouraging initial performance numbers!... I'd like to do some more validation to make sure I didn't screw up any of the benchmarks before sharing, but the read story is better than I could have ever imagined!

          Show
          mkjellman Michael Kjellman added a comment - I have some insanely encouraging initial performance numbers!... I'd like to do some more validation to make sure I didn't screw up any of the benchmarks before sharing, but the read story is better than I could have ever imagined!
          Hide
          mkjellman Michael Kjellman added a comment -

          Alright, good news! My unit test that creates and reads from an index with 100,000,000 entries (!!) successfully passes!

          Came up with a pretty nice solution to the word-list issue (unable to find a word list of 100m+ entries) and instead I am creating n TimeUUID elements – which nicely removes duplicates, can create an infinite number of, and come already sorted as they're being generated!

          I'm currently profiling the code to come up with numbers...

          Show
          mkjellman Michael Kjellman added a comment - Alright, good news! My unit test that creates and reads from an index with 100,000,000 entries (!!) successfully passes! Came up with a pretty nice solution to the word-list issue (unable to find a word list of 100m+ entries) and instead I am creating n TimeUUID elements – which nicely removes duplicates, can create an infinite number of, and come already sorted as they're being generated! I'm currently profiling the code to come up with numbers...
          Hide
          mkjellman Michael Kjellman added a comment -

          Also, does anyone know of any truly massive word lists that are totally free of legal concerns for testing? (I'm looking for >3-4 million words)

          Show
          mkjellman Michael Kjellman added a comment - Also, does anyone know of any truly massive word lists that are totally free of legal concerns for testing? (I'm looking for >3-4 million words)
          Hide
          mkjellman Michael Kjellman added a comment -

          I have the new FileSegment friendly implementation working for the following conditions:

          1) straight search for key -> get value
          2) iterate efficiently both forwards and reversed thru all elements in the tree
          3) binary search for a given key and then iterate thru all remaining keys from the found offset
          4) overflow page for handling variable length tree elements that exceed the max size for a given individual page (up to 2GB)

          I also have successfully ran some new unit tests I wrote that now do 5000 consecutive iterations with randomly generated data (to "fuzz" the tree for edge conditions) for building and validating trees that contain between 300,000-500,000 elements. I've also spent a good amount of time writing some pretty reasonable documentation of the binary format itself.

          Tomorrow, I'm planning on testing a 4.5GB individual tree against the new implementation and doing some profiling to see the exact memory impact now that it's basically completed on both the serialization and deserialization paths. Will update with those findings tomorrow!

          Show
          mkjellman Michael Kjellman added a comment - I have the new FileSegment friendly implementation working for the following conditions: 1) straight search for key -> get value 2) iterate efficiently both forwards and reversed thru all elements in the tree 3) binary search for a given key and then iterate thru all remaining keys from the found offset 4) overflow page for handling variable length tree elements that exceed the max size for a given individual page (up to 2GB) I also have successfully ran some new unit tests I wrote that now do 5000 consecutive iterations with randomly generated data (to "fuzz" the tree for edge conditions) for building and validating trees that contain between 300,000-500,000 elements. I've also spent a good amount of time writing some pretty reasonable documentation of the binary format itself. Tomorrow, I'm planning on testing a 4.5GB individual tree against the new implementation and doing some profiling to see the exact memory impact now that it's basically completed on both the serialization and deserialization paths. Will update with those findings tomorrow!
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Jack Krupansky Priority of the JIRA ticket here is irrelevant now that it's being actively worked on - and also CASSANDRA-11206.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Jack Krupansky Priority of the JIRA ticket here is irrelevant now that it's being actively worked on - and also CASSANDRA-11206 .
          Hide
          jkrupan Jack Krupansky added a comment -

          Is this issue still considered a Minor priority? Seems like a bigger deal to me. +1 for making it a Major priority - unless there is a longer list of even bigger fish in the queue.

          Just today there is a user on the list struggling with time series data and really not wanting to have to split a partition that he needs to be able to scan. Of source, scanning a super-wide partition will still be a very bad idea anyway, but at least more narrow scans would still be workable with this improvement in place.

          Is this a 3.x improvement or 4.x or beyond? +1 for 3.x (3.6? 3.8?).

          Show
          jkrupan Jack Krupansky added a comment - Is this issue still considered a Minor priority? Seems like a bigger deal to me. +1 for making it a Major priority - unless there is a longer list of even bigger fish in the queue. Just today there is a user on the list struggling with time series data and really not wanting to have to split a partition that he needs to be able to scan. Of source, scanning a super-wide partition will still be a very bad idea anyway, but at least more narrow scans would still be workable with this improvement in place. Is this a 3.x improvement or 4.x or beyond? +1 for 3.x (3.6? 3.8?).
          Hide
          mkjellman Michael Kjellman added a comment -

          Had a very productive day... made a huge amount of progress on the deserialization side of things and the required serialization changes.

          Show
          mkjellman Michael Kjellman added a comment - Had a very productive day... made a huge amount of progress on the deserialization side of things and the required serialization changes.
          Hide
          mkjellman Michael Kjellman added a comment -

          Update: serialization path basically done with new design more friendly for SegmentedFile issue. Will finish that up today and move onto Deserialization.

          Show
          mkjellman Michael Kjellman added a comment - Update: serialization path basically done with new design more friendly for SegmentedFile issue. Will finish that up today and move onto Deserialization.
          Hide
          mkjellman Michael Kjellman added a comment -

          well - then you'd lose the cache-line aware logic I've implemented to make the B+ tree efficient...

          Show
          mkjellman Michael Kjellman added a comment - well - then you'd lose the cache-line aware logic I've implemented to make the B+ tree efficient...
          Hide
          jbellis Jonathan Ellis added a comment -

          Would it make sense to always used buffered i/o for the B+ instead of mmap?

          Show
          jbellis Jonathan Ellis added a comment - Would it make sense to always used buffered i/o for the B+ instead of mmap?
          Hide
          mkjellman Michael Kjellman added a comment -

          I was asked to update. Making good progress. A large number of tests pass. I'm basically just getting the math right for "Inception: The Cassandra Director's Cut" due to the need to make a B+ Tree (disk based by it's very definition) work with SegmentedFile etc...

          Show
          mkjellman Michael Kjellman added a comment - I was asked to update. Making good progress. A large number of tests pass. I'm basically just getting the math right for "Inception: The Cassandra Director's Cut" due to the need to make a B+ Tree (disk based by it's very definition) work with SegmentedFile etc...
          Hide
          mkjellman Michael Kjellman added a comment -

          Jack Krupansky ~2GB is the max target at the moment I'd recommend from experience.

          The current implementation will create a IndexInfo entry every 64kb (by default - but I highly doubt anyone actually changes this default) worth of data. Each IndexInfo object contains the offset into the sstable where the partition/row starts, the length to read, and the name. These IndexInfo objects are placed into a list and binary searched over to find the name closest to the query. Then, we go to that offset in the sstable and start reading the actual data.

          The issue here that makes things so bad with large partitions is when doing an Indexed read across a given partition the entire list of indexinfo objects is currently just serialized one after another into the index file on disk. To use it we have to read the entire thing off disk, deserializing every IndexInfo object, place it into a list, and the binary search across it. This creates a ton of small objects very quickly that are likely to be promoted and thus create a lot of GC pressure.

          If you take the average size of each column you have in a row you can figure out how many index entry objects will be created (for every 64k of your data in that partition). I've found that once the index info array will contain > 300k objects things get bad.

          The implementation I'm almost done with has the same big O complexity (O(log)) as the current implementation but instead the index is backed by page cache aligned mmap'ed segments (B+ tree-ish with an overflow page implementation similar to that of SQLite). This means we can now walk the IndexEntry objects an only bring the 4k chunks onto the heap that are involved in the binary search for the correct entry itself.

          The tree itself is finished and heavily tested. I've also already abstracted out the index implementation in Cassandra so that the current implementation and the new one I'll be proposing and contributing here can be dropped in easily without special casing the code all over the place to check the SSTable descriptor for what index implementation was used. All the unit tests and d-tests pass after my abstraction work. The final thing I'm almost done with is refactoring my Page Cache Aligned/Aware File Writer to be SegmentedFile aware (and make sure all the math works when the offset into the actual file will differ depending on the segment etc).

          Show
          mkjellman Michael Kjellman added a comment - Jack Krupansky ~2GB is the max target at the moment I'd recommend from experience. The current implementation will create a IndexInfo entry every 64kb (by default - but I highly doubt anyone actually changes this default) worth of data. Each IndexInfo object contains the offset into the sstable where the partition/row starts, the length to read, and the name. These IndexInfo objects are placed into a list and binary searched over to find the name closest to the query. Then, we go to that offset in the sstable and start reading the actual data. The issue here that makes things so bad with large partitions is when doing an Indexed read across a given partition the entire list of indexinfo objects is currently just serialized one after another into the index file on disk. To use it we have to read the entire thing off disk, deserializing every IndexInfo object, place it into a list, and the binary search across it. This creates a ton of small objects very quickly that are likely to be promoted and thus create a lot of GC pressure. If you take the average size of each column you have in a row you can figure out how many index entry objects will be created (for every 64k of your data in that partition). I've found that once the index info array will contain > 300k objects things get bad. The implementation I'm almost done with has the same big O complexity (O(log )) as the current implementation but instead the index is backed by page cache aligned mmap'ed segments (B+ tree-ish with an overflow page implementation similar to that of SQLite). This means we can now walk the IndexEntry objects an only bring the 4k chunks onto the heap that are involved in the binary search for the correct entry itself. The tree itself is finished and heavily tested. I've also already abstracted out the index implementation in Cassandra so that the current implementation and the new one I'll be proposing and contributing here can be dropped in easily without special casing the code all over the place to check the SSTable descriptor for what index implementation was used. All the unit tests and d-tests pass after my abstraction work. The final thing I'm almost done with is refactoring my Page Cache Aligned/Aware File Writer to be SegmentedFile aware (and make sure all the math works when the offset into the actual file will differ depending on the segment etc).
          Hide
          jkrupan Jack Krupansky added a comment -

          large CQL partitions (4GB,75GB,etc)

          What is the intended target/sweet spot for large partitions... 1GB, 2GB, 4GB, 8GB, 10GB, 15GB, 16GB, or... what? Will random access to larger partitions create any significant heap/off-heap memory demand, or will heap/memory simply become the total rows accessed regardless of how they might be bucketed into partitions?

          Will we be able to tell people that bucketing of partitions is now never needed, or will there now just be a larger bucket size, like 4GB/partition rather than the 10MB or 50MB or 100MB that some of us recommend today?

          Show
          jkrupan Jack Krupansky added a comment - large CQL partitions (4GB,75GB,etc) What is the intended target/sweet spot for large partitions... 1GB, 2GB, 4GB, 8GB, 10GB, 15GB, 16GB, or... what? Will random access to larger partitions create any significant heap/off-heap memory demand, or will heap/memory simply become the total rows accessed regardless of how they might be bucketed into partitions? Will we be able to tell people that bucketing of partitions is now never needed, or will there now just be a larger bucket size, like 4GB/partition rather than the 10MB or 50MB or 100MB that some of us recommend today?
          Hide
          jbellis Jonathan Ellis added a comment -

          Michael Kjellman I'd be very interested to see what impact this has on query performance as partition size grows.

          Show
          jbellis Jonathan Ellis added a comment - Michael Kjellman I'd be very interested to see what impact this has on query performance as partition size grows.
          Hide
          mkjellman Michael Kjellman added a comment -

          I spent the last 15 hours finishing up the last remaining pieces on the serialization... almost there..

          Show
          mkjellman Michael Kjellman added a comment - I spent the last 15 hours finishing up the last remaining pieces on the serialization... almost there..
          Hide
          jjirsa Jeff Jirsa added a comment -

          Michael Kjellman - I know you've put a lot of thought into this already. This is impacting us and I've love to help. Is there anything I can do to assist? Are you working on a patch I can help you test (or can I volunteer to help write tests or similar)?

          Show
          jjirsa Jeff Jirsa added a comment - Michael Kjellman - I know you've put a lot of thought into this already. This is impacting us and I've love to help. Is there anything I can do to assist? Are you working on a patch I can help you test (or can I volunteer to help write tests or similar)?
          Hide
          aweisberg Ariel Weisberg added a comment -

          Maybe it makes sense to have an intermediate step. Leverage Robert's work in 9738 to transition to simple off heap representation that can be mapped and then remove the key cache. It seems like this would effectively increase the upper bound on usable partition size in all cases compared to what we have today (is this a true statement?).

          After, or in parallel work on another representation for partition indexes.

          Show
          aweisberg Ariel Weisberg added a comment - Maybe it makes sense to have an intermediate step. Leverage Robert's work in 9738 to transition to simple off heap representation that can be mapped and then remove the key cache. It seems like this would effectively increase the upper bound on usable partition size in all cases compared to what we have today (is this a true statement?). After, or in parallel work on another representation for partition indexes.
          Hide
          snazy Robert Stupp added a comment -

          I think we definitely need better data structures since RIE is neither a good fit for KC nor index. That's the point in this ticket, CASSANDRA-8931, CASSANDRA-9843 and the pitfall in current WIP in CASSANDRA-9738.
          Not fully agree on relying on page cache due to its granularity (4kB i think) which might be too coarse for keys. But that depends on the actual data structure - i.e. grouping "hot" keys per page, which contradicts with immutable sstables.
          Another point is the effort to move to thread-per-core model, having distinct and independent data structures per thread without barriers/locks/whatever - and page-cache is a shared resource.
          Next thing is hot and cold data - i.e. we could use bigger intervals (column_index_size_in_kb in current terminology) for cold data.
          TBC: I'm not against page cache or so - just want to note what I think may influence new stuff.

          Show
          snazy Robert Stupp added a comment - I think we definitely need better data structures since RIE is neither a good fit for KC nor index. That's the point in this ticket, CASSANDRA-8931 , CASSANDRA-9843 and the pitfall in current WIP in CASSANDRA-9738 . Not fully agree on relying on page cache due to its granularity (4kB i think) which might be too coarse for keys. But that depends on the actual data structure - i.e. grouping "hot" keys per page, which contradicts with immutable sstables. Another point is the effort to move to thread-per-core model, having distinct and independent data structures per thread without barriers/locks/whatever - and page-cache is a shared resource. Next thing is hot and cold data - i.e. we could use bigger intervals (column_index_size_in_kb in current terminology) for cold data. TBC: I'm not against page cache or so - just want to note what I think may influence new stuff.
          Hide
          mkjellman Michael Kjellman added a comment -

          Just a couple more thoughts:

          1) One big question I have is how a On-Disk B+ Tree would play with the on-heap (or off-heap) cache. Ideally, I believe we would only want to cache on the heap columns that we got a read request for to avoid polluting the heap with objects that were never actually even requested by the user. However, as we store intervals in the index and not all the actual values, I'm not sure how to do efficient in memory lookups if we only stored fragments of the overall index for a given key. For instance, If you put 1 matching leaf of index objects from the b+ tree into the cache and then got another request for the say key but different index interval, we'd need to constantly keep rebalancing some kind of data structure on the heap and I'm operating under the assumption that would be pretty inefficient and painful. Which brings me to...
          2) Maybe the KeyCache isn't necessary if we implement an efficient B+ On-Disk format. I'm doubtful that anything we implement from a cache perspective inside the application will be better than the kernel's page cache. Frequently accessed keys would be the pages most likely to be in the page cache as well, so we still should get the benefit of LRU eviction.

          Show
          mkjellman Michael Kjellman added a comment - Just a couple more thoughts: 1) One big question I have is how a On-Disk B+ Tree would play with the on-heap (or off-heap) cache. Ideally, I believe we would only want to cache on the heap columns that we got a read request for to avoid polluting the heap with objects that were never actually even requested by the user. However, as we store intervals in the index and not all the actual values, I'm not sure how to do efficient in memory lookups if we only stored fragments of the overall index for a given key. For instance, If you put 1 matching leaf of index objects from the b+ tree into the cache and then got another request for the say key but different index interval, we'd need to constantly keep rebalancing some kind of data structure on the heap and I'm operating under the assumption that would be pretty inefficient and painful. Which brings me to... 2) Maybe the KeyCache isn't necessary if we implement an efficient B+ On-Disk format. I'm doubtful that anything we implement from a cache perspective inside the application will be better than the kernel's page cache. Frequently accessed keys would be the pages most likely to be in the page cache as well, so we still should get the benefit of LRU eviction.
          Hide
          jbellis Jonathan Ellis added a comment -

          1. Learning time for us would be compaction
          2. ISTM this was not core to the algorithm, but it's been a while since I read the details
          3. We could store the offset in the ARF leaves, this was definitely not core
          4, 5. Yes, this is a key point. Like our existing index, ARF is designed to be memory-resident. As partitions grow larger the ARF would degrade accuracy rather than spilling to disk (like a B-tree) or getting obscenely large (like our existing index).

          I would add,

          6. Because of (5), ARF gives you BF-like behavior for range queries and can quickly optimize away scans of sstables that don't contain the data in question. (A very good fit for DTCS; a smaller benefit for LCS.)

          So, maybe we really want both. ARF for the quick reject, (on-disk) B+ for "where do I start scanning."

          Show
          jbellis Jonathan Ellis added a comment - 1. Learning time for us would be compaction 2. ISTM this was not core to the algorithm, but it's been a while since I read the details 3. We could store the offset in the ARF leaves, this was definitely not core 4, 5. Yes, this is a key point. Like our existing index, ARF is designed to be memory-resident. As partitions grow larger the ARF would degrade accuracy rather than spilling to disk (like a B-tree) or getting obscenely large (like our existing index). I would add, 6. Because of (5), ARF gives you BF-like behavior for range queries and can quickly optimize away scans of sstables that don't contain the data in question. (A very good fit for DTCS; a smaller benefit for LCS.) So, maybe we really want both. ARF for the quick reject, (on-disk) B+ for "where do I start scanning."
          Hide
          mkjellman Michael Kjellman added a comment -

          Jonathan Ellis I read a bunch of the Adaptive Range Filters (ARF) paper from CASSANDRA-9843 and I'm not sure it's a good fit here, but please let me know if I misunderstood the paper.

          1) The key optimization that they make in ARF is the "adaptive" or learning component that resizes and optimizes to adjust the shape and layout of the B-Tree dynamically as elements are added or removed. This was important to the original authors as Microsoft SQL Server appears to have "hot" and "cold" storage and so every time a record exceeds the threshold to go from hot -> cold storage they need to efficiently modify and update the tree. Given that all of our data structures in this case will be immutable this has no benefit to us.
          2) The paper uses fixed length keys for the intervals. I'm unsure how we could modify this to accommodate variable length start and end values as are required for index info
          3) ARF (like Bloom Filters) only returns true or false to indicate the presence of a value in that interval. For IndexInfo we need the offset and width into the data to start reading to retrieve the actual data. I'm not sure how we can modify ARF to accommodate returning values and not just a boolean.
          4) If we dynamically adjust the size of the intervals and make them larger it means that we will make the index even worse, requiring us to read a larger amount of actual data as our offset and width will be increased. If the query was a false positive it would mean they would become even more expensive given the possibility of reading a huge amount of data for the larger interval to actually find no data.
          5) It appears ARFs were always intended to stay in memory. I'm not sure how feasible it would be to implement ARF in an on-disk cache friendly way. As we need to store the bytes for the first and last values for each indexed position I think we still should use a data structure that can use from disk without deserializing the entire structure onto the heap (which is the same problem we have with the current storage format today).

          Let me know if you had any more thoughts or if I've misunderstood the paper!

          Show
          mkjellman Michael Kjellman added a comment - Jonathan Ellis I read a bunch of the Adaptive Range Filters (ARF) paper from CASSANDRA-9843 and I'm not sure it's a good fit here, but please let me know if I misunderstood the paper. 1) The key optimization that they make in ARF is the "adaptive" or learning component that resizes and optimizes to adjust the shape and layout of the B-Tree dynamically as elements are added or removed. This was important to the original authors as Microsoft SQL Server appears to have "hot" and "cold" storage and so every time a record exceeds the threshold to go from hot -> cold storage they need to efficiently modify and update the tree. Given that all of our data structures in this case will be immutable this has no benefit to us. 2) The paper uses fixed length keys for the intervals. I'm unsure how we could modify this to accommodate variable length start and end values as are required for index info 3) ARF (like Bloom Filters) only returns true or false to indicate the presence of a value in that interval. For IndexInfo we need the offset and width into the data to start reading to retrieve the actual data. I'm not sure how we can modify ARF to accommodate returning values and not just a boolean. 4) If we dynamically adjust the size of the intervals and make them larger it means that we will make the index even worse, requiring us to read a larger amount of actual data as our offset and width will be increased. If the query was a false positive it would mean they would become even more expensive given the possibility of reading a huge amount of data for the larger interval to actually find no data. 5) It appears ARFs were always intended to stay in memory. I'm not sure how feasible it would be to implement ARF in an on-disk cache friendly way. As we need to store the bytes for the first and last values for each indexed position I think we still should use a data structure that can use from disk without deserializing the entire structure onto the heap (which is the same problem we have with the current storage format today). Let me know if you had any more thoughts or if I've misunderstood the paper!
          Hide
          kohlisankalp sankalp kohli added a comment -

          How is this any different than an interval tree where we using a boolean instead of a max for leaf nodes?

          Disclaimer: I did not read the whole paper

          Show
          kohlisankalp sankalp kohli added a comment - How is this any different than an interval tree where we using a boolean instead of a max for leaf nodes? Disclaimer: I did not read the whole paper
          Hide
          mkjellman Michael Kjellman added a comment -

          Jonathan Ellis I PoC'ed a B+ Tree but certainly not tied to it in any way. Let me read this white paper today and see if I can get a PoC together.

          Show
          mkjellman Michael Kjellman added a comment - Jonathan Ellis I PoC'ed a B+ Tree but certainly not tied to it in any way. Let me read this white paper today and see if I can get a PoC together.
          Hide
          jbellis Jonathan Ellis added a comment -

          ARF may be a better fit than B+tree. CASSANDRA-9843

          Show
          jbellis Jonathan Ellis added a comment - ARF may be a better fit than B+tree. CASSANDRA-9843
          Hide
          kohlisankalp sankalp kohli added a comment -

          If we use B+ tree, we can actually put a leaf node in key cache. For most small partitions, leaf=root due to there small size.

          Show
          kohlisankalp sankalp kohli added a comment - If we use B+ tree, we can actually put a leaf node in key cache. For most small partitions, leaf=root due to there small size.
          Hide
          snazy Robert Stupp added a comment -

          Linking CASSANDRA-9738 as it runs into the same problem with IndexInfo objects (just off-heap instead of disk).

          Show
          snazy Robert Stupp added a comment - Linking CASSANDRA-9738 as it runs into the same problem with IndexInfo objects (just off-heap instead of disk).
          Hide
          mkjellman Michael Kjellman added a comment -

          We've had a bunch of discussions around this over the past few weeks and I think i finally have a grasp of the entire issue. The issue here is that large CQL partitions (4GB,75GB,etc) end up with large 200MB+ serialized indexes. The current logic is when we don't get a cache hit to deserialize the entire thing and split it into IndexInfo objects which contain 2 ByteBuffers (first and last key), and 2 longs (Offset and Width). This means we get a very very large amount of small most likely very shortly lived objects creating garbage on the heap — and with a high probability they will be evicted from the cache anyways. On disk we just lay out the objects down with the assumption the entire thing will always be deserialized when it's needed and never accessed from disk without deserializing the entire thing.

          I think the only option here is to make a change to the actual way we lay things out on disk. Two options would be a Skip List or a B+ Tree where we mmap the pages of the index and try to do something intelligent to avoid actually bringing objects onto the heap as much as possible. The downside a B+ Tree would be the overhead of creating it on flush and it's log (although the current code is log too as we binary search over the objects we deserialized into the List, but just do it on the heap.

          The only references I could find to B+ Trees in this project were CASSANDRA-6709 and CASSANDRA-7447. I think we don't need to reinvent the wheel here and entirely change the storage format but I think if we just use a targeted data structure just for the Index we might get something nice. The question would be what impact will this have for "normal" rows/partitions.

          Any input on other on disk data structures we might want to consider would be great.

          The other issue is that I'd love to be able to only cache the column that we got a hit on for the cache. Unfortunately that might be difficult. Today we binary search over the entire List<IndexInfo> to find hits. If you get a column that's in between the first and last name you return the left node and go and check and hopefully it's actually there. As we essentially have interval-ish objects here along with non fixed length values it does make things a bit more fun.

          Show
          mkjellman Michael Kjellman added a comment - We've had a bunch of discussions around this over the past few weeks and I think i finally have a grasp of the entire issue. The issue here is that large CQL partitions (4GB,75GB,etc) end up with large 200MB+ serialized indexes. The current logic is when we don't get a cache hit to deserialize the entire thing and split it into IndexInfo objects which contain 2 ByteBuffers (first and last key), and 2 longs (Offset and Width). This means we get a very very large amount of small most likely very shortly lived objects creating garbage on the heap — and with a high probability they will be evicted from the cache anyways. On disk we just lay out the objects down with the assumption the entire thing will always be deserialized when it's needed and never accessed from disk without deserializing the entire thing. I think the only option here is to make a change to the actual way we lay things out on disk. Two options would be a Skip List or a B+ Tree where we mmap the pages of the index and try to do something intelligent to avoid actually bringing objects onto the heap as much as possible. The downside a B+ Tree would be the overhead of creating it on flush and it's log (although the current code is log too as we binary search over the objects we deserialized into the List, but just do it on the heap. The only references I could find to B+ Trees in this project were CASSANDRA-6709 and CASSANDRA-7447 . I think we don't need to reinvent the wheel here and entirely change the storage format but I think if we just use a targeted data structure just for the Index we might get something nice. The question would be what impact will this have for "normal" rows/partitions. Any input on other on disk data structures we might want to consider would be great. The other issue is that I'd love to be able to only cache the column that we got a hit on for the cache. Unfortunately that might be difficult. Today we binary search over the entire List<IndexInfo> to find hits. If you get a column that's in between the first and last name you return the left node and go and check and hopefully it's actually there. As we essentially have interval-ish objects here along with non fixed length values it does make things a bit more fun.
          Hide
          snazy Robert Stupp added a comment -

          Yes, the purpose of CASSANDRA-9738 is to move the (permanent part of the) key-cache to off-heap - basically eliminating a huge amount of tiny object in the old gen.

          we can still optimize on reducing the objects created

          Absolutely agree! Although, the first benchmark I did with CASSANDRA-9738 shows a huge reduction of GC effort of roughly 90% (total and avg GC time with 8u45+G1) for a read-only workload with actually more (short-lived) objects created - I suppose these will never be promoted to the old gen.

          Show
          snazy Robert Stupp added a comment - Yes, the purpose of CASSANDRA-9738 is to move the (permanent part of the) key-cache to off-heap - basically eliminating a huge amount of tiny object in the old gen. we can still optimize on reducing the objects created Absolutely agree! Although, the first benchmark I did with CASSANDRA-9738 shows a huge reduction of GC effort of roughly 90% (total and avg GC time with 8u45+G1) for a read-only workload with actually more (short-lived) objects created - I suppose these will never be promoted to the old gen.
          Hide
          kohlisankalp sankalp kohli added a comment -

          If I am understanding it correctly, CASSANDRA-9738 will move the key cache off heap. This will help as these IndexInfo objects won't be promoted as they won't be referenced by any on heap map.

          However, we will still need to generate all these objects for each read whether it is in key cache or not. With CASSANDRA-9738, on key cache hit, we will create new objects again in deserializing.

          I think CASSANDRA-9738 is good but we can still optimize on reducing the objects created which is currently promotional to the size of the CQL partition.

          Show
          kohlisankalp sankalp kohli added a comment - If I am understanding it correctly, CASSANDRA-9738 will move the key cache off heap. This will help as these IndexInfo objects won't be promoted as they won't be referenced by any on heap map. However, we will still need to generate all these objects for each read whether it is in key cache or not. With CASSANDRA-9738 , on key cache hit, we will create new objects again in deserializing. I think CASSANDRA-9738 is good but we can still optimize on reducing the objects created which is currently promotional to the size of the CQL partition.
          Hide
          jbellis Jonathan Ellis added a comment -

          CASSANDRA-9738 might be the simplest way forward then.

          Show
          jbellis Jonathan Ellis added a comment - CASSANDRA-9738 might be the simplest way forward then.
          Hide
          kohlisankalp sankalp kohli added a comment -

          The strongly referenced IndexInfo objects are mostly from key cache(100MB we are using). The problem is that due to key cache, all these objects will get promoted and cause fragmentation when evicted from key cache.

          I also looked at a heap dump generated by an OOM. 93% of the retained size was coming from IndexInfo.

          Show
          kohlisankalp sankalp kohli added a comment - The strongly referenced IndexInfo objects are mostly from key cache(100MB we are using). The problem is that due to key cache, all these objects will get promoted and cause fragmentation when evicted from key cache. I also looked at a heap dump generated by an OOM. 93% of the retained size was coming from IndexInfo.
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          But are most of those in the index summary (should be relatively stable once tenured) or the rowcache [Edit: meant key cache] (high churn)?

          Show
          jbellis Jonathan Ellis added a comment - - edited But are most of those in the index summary (should be relatively stable once tenured) or the rowcache [Edit: meant key cache] (high churn)?

            People

            • Assignee:
              mkjellman Michael Kjellman
              Reporter:
              kohlisankalp sankalp kohli
              Reviewer:
              Branimir Lambov
            • Votes:
              9 Vote for this issue
              Watchers:
              43 Start watching this issue

              Dates

              • Created:
                Updated:

                Development