Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Key cache still uses a concurrent map on-heap. This could go to off-heap and feels doable now after CASSANDRA-8099.

      Evaluation should be done in advance based on a POC to prove that pure off-heap counter cache buys a performance and/or gc-pressure improvement.

      In theory, elimination of on-heap management of the map should buy us some benefit.

        Issue Links

          Activity

          Hide
          snazy Robert Stupp added a comment -

          Closed as won't-fix.

          We already have CASSANDRA-11206 to handle large partitions and CASSANDRA-9754 is also making progress. Both tickets make this one somehow superfluous.

          Show
          snazy Robert Stupp added a comment - Closed as won't-fix. We already have CASSANDRA-11206 to handle large partitions and CASSANDRA-9754 is also making progress. Both tickets make this one somehow superfluous.
          Hide
          snazy Robert Stupp added a comment -

          With CASSANDRA-11206 we eliminate a big pain (IndexInfo GC pressure and serialization).

          We could move the key-cache into off-heap after 11206. But it's also possible that other optimizations (CASSANDRA-8931, CASSANDRA-9754 and CASSANDRA-9786) may result in a completely different structure. Having that said, I tend to prefer at least 8931 and 9786 before tackling this one and see what the outcome of these is.

          Show
          snazy Robert Stupp added a comment - With CASSANDRA-11206 we eliminate a big pain (IndexInfo GC pressure and serialization). We could move the key-cache into off-heap after 11206. But it's also possible that other optimizations ( CASSANDRA-8931 , CASSANDRA-9754 and CASSANDRA-9786 ) may result in a completely different structure. Having that said, I tend to prefer at least 8931 and 9786 before tackling this one and see what the outcome of these is.
          Hide
          snazy Robert Stupp added a comment -

          Yes, it is.
          Just want to get the current UDF/UDA stuff done before.
          Generally, I'd like to see what's coming with CASSANDRA-9754. The current "problem" is the amount of off-heap serializations for big partitions as we do a binary search for the lookup. If that gets better with CASSANDRA-9754, that would be great.
          We can optimize the whole path with CASSANDRA-8931 and CASSANDRA-9786. If these give promising results, a follow-up might be to merge the summary into index-info.

          Show
          snazy Robert Stupp added a comment - Yes, it is. Just want to get the current UDF/UDA stuff done before. Generally, I'd like to see what's coming with CASSANDRA-9754 . The current "problem" is the amount of off-heap serializations for big partitions as we do a binary search for the lookup. If that gets better with CASSANDRA-9754 , that would be great. We can optimize the whole path with CASSANDRA-8931 and CASSANDRA-9786 . If these give promising results, a follow-up might be to merge the summary into index-info.
          Hide
          jbellis Jonathan Ellis added a comment -

          Is this still on your plate, Robert?

          Show
          jbellis Jonathan Ellis added a comment - Is this still on your plate, Robert?
          Hide
          snazy Robert Stupp added a comment -

          One thing to consider for RC1 IMO is to change the index file format. Having that for RC1 gives us the flexibility to add 9738 for 3.0.0 or 3.0.x - so I've created CASSANDRA-10314.

          Show
          snazy Robert Stupp added a comment - One thing to consider for RC1 IMO is to change the index file format. Having that for RC1 gives us the flexibility to add 9738 for 3.0.0 or 3.0.x - so I've created CASSANDRA-10314 .
          Hide
          snazy Robert Stupp added a comment -

          consider pushing this back

          Also +1.
          The major open tasks are to eliminate the need to copy the RIE on cache-get, which requires bullet-proof ref-counting, and double buffering during RIE building, which is sometimes just a waste of CPU/heap as the returned RIE object is never used.

          Show
          snazy Robert Stupp added a comment - consider pushing this back Also +1. The major open tasks are to eliminate the need to copy the RIE on cache-get, which requires bullet-proof ref-counting, and double buffering during RIE building, which is sometimes just a waste of CPU/heap as the returned RIE object is never used.
          Hide
          slebresne Sylvain Lebresne added a comment -

          If the code is still being majorly revised so close to RC, we should consider pushing this back

          I agree.

          Show
          slebresne Sylvain Lebresne added a comment - If the code is still being majorly revised so close to RC, we should consider pushing this back I agree.
          Hide
          benedict Benedict added a comment -

          In hindsight wholesale replacing the existing key cache was not a good incremental way to effect this change in a low risk way

          So, looking at the latest performance comparisons, it doesn't look like the benefit is so uniformly profound as to be worth rushing this out. It looks like whatever was particularly affecting 3.0 in the early benchmarks (that notably wasn't affecting prior versions) has been ironed out.

          If the code is still being majorly revised so close to RC, we should consider pushing this back, and if incorporated before GA at least consider making it optional. It's turning into a patch of really significant size, and we already have a great deal of risk associated with this release.

          Show
          benedict Benedict added a comment - In hindsight wholesale replacing the existing key cache was not a good incremental way to effect this change in a low risk way So, looking at the latest performance comparisons, it doesn't look like the benefit is so uniformly profound as to be worth rushing this out. It looks like whatever was particularly affecting 3.0 in the early benchmarks (that notably wasn't affecting prior versions) has been ironed out. If the code is still being majorly revised so close to RC, we should consider pushing this back, and if incorporated before GA at least consider making it optional. It's turning into a patch of really significant size, and we already have a great deal of risk associated with this release.
          Hide
          aweisberg Ariel Weisberg added a comment -

          OK, well I don't have an answer for 3.0 on how to completely eliminate the risk that the copying overhead won't be an issue.

          If it keeps changing there is no way I will get to the point where I can +1 the performance dimension. I might be able to get to +1 correctness if things stop changing long enough for me to checkout it out and walk through everything. Performance could very well be good enough/better for realistically sized indexed entries. If we benchmark the final code and it's better then I guess I could be +1 on performance and ignore the overhead of allocating and copying for each cache access.

          In hindsight wholesale replacing the existing key cache was not a good incremental way to effect this change in a low risk way. Introducing a new key cache and iterating on it would have allowed for faster development, more frequent merging to trunk, smaller reviews etc. We would also always have something to ship once release time rolls around even if something change was pending.

          Show
          aweisberg Ariel Weisberg added a comment - OK, well I don't have an answer for 3.0 on how to completely eliminate the risk that the copying overhead won't be an issue. If it keeps changing there is no way I will get to the point where I can +1 the performance dimension. I might be able to get to +1 correctness if things stop changing long enough for me to checkout it out and walk through everything. Performance could very well be good enough/better for realistically sized indexed entries. If we benchmark the final code and it's better then I guess I could be +1 on performance and ignore the overhead of allocating and copying for each cache access. In hindsight wholesale replacing the existing key cache was not a good incremental way to effect this change in a low risk way. Introducing a new key cache and iterating on it would have allowed for faster development, more frequent merging to trunk, smaller reviews etc. We would also always have something to ship once release time rolls around even if something change was pending. IndexEntryStats, I don't get why you are casting down to int, why not checked cast on the way out? I don't understand overflow protection here. It's a long is it really going to overflow? Average not taking into account the distribution is suspect to me. Should the allocation be slightly larger than average to prevent reallocation? When allocating the offset buffer it's using the count of index infos without multiplying by their size? We are writing sampled records. I am skeptical this is worth optimizing when we have to get this done and correct. There are better options in subsequent versions like appending to a file. Did this change have an impact and improve some metric?
          Hide
          benedict Benedict added a comment -

          In a lot of cases the RIE is going to be enclosed by a Closable iterator anyways.

          Given our historically poor handling of resource release, and how many nooks and crannies they end up in now, I'd prefer we didn't depend on the closure of iterators until we have some better facilities in place for catching leaks, like CASSANDRA-9918. Right now I don't think we're likely to leak much on a per-request/leak basis (just prevent some shared cleanup, per discarded major resource, such as an sstable) if we fail to close an iterator. So this could magnify the effect of any such bug.

          Show
          benedict Benedict added a comment - In a lot of cases the RIE is going to be enclosed by a Closable iterator anyways. Given our historically poor handling of resource release, and how many nooks and crannies they end up in now, I'd prefer we didn't depend on the closure of iterators until we have some better facilities in place for catching leaks, like CASSANDRA-9918 . Right now I don't think we're likely to leak much on a per-request/leak basis (just prevent some shared cleanup, per discarded major resource, such as an sstable) if we fail to close an iterator. So this could magnify the effect of any such bug.
          Hide
          snazy Robert Stupp added a comment -

          Pushed another update with the following changes:

          • Legacy IndexEntry objects / missing offsets: re-serializing seemed too expensive to me but not having the offsets in key-cache is also bad. For 3.0+ IndexEntry structures, we have the offsets - so that's not a problem. For legacy IndexEntry structures, it generates all offsets. The change is that the key-cache will include the calculated offsets even for legacy tables. This eliminates the need to re-serialize the whole thing. (legacyIndexInfoSearch() is gone.)
          • Moved the offsets-"array" back to the end of serialized IndexEntry structure.
          • Sizing the serialization buffers: Tried to find easily and cheaply accessibly stats for partition size and (more important) serialized IndexEntry size for a CF but didn't find anything. So I added a (poor man) stats to CFMetaData that tracks the numbers to size the serialization buffers appropriately. These stats are not persisted, just calculated during runtime - so it's not a "perfect" thing. The problem with existing stats is that we know the mean partition size but not the mean size of the ClusteringPrefixes which could be used to initialize the poor-man stats during startup. (Note: current TableMetrics.meanPartitionSize iterates over all sstables to retrieve the value)
          Show
          snazy Robert Stupp added a comment - Pushed another update with the following changes: Legacy IndexEntry objects / missing offsets: re-serializing seemed too expensive to me but not having the offsets in key-cache is also bad. For 3.0+ IndexEntry structures, we have the offsets - so that's not a problem. For legacy IndexEntry structures, it generates all offsets. The change is that the key-cache will include the calculated offsets even for legacy tables. This eliminates the need to re-serialize the whole thing. (legacyIndexInfoSearch() is gone.) Moved the offsets-"array" back to the end of serialized IndexEntry structure. Sizing the serialization buffers: Tried to find easily and cheaply accessibly stats for partition size and (more important) serialized IndexEntry size for a CF but didn't find anything. So I added a (poor man) stats to CFMetaData that tracks the numbers to size the serialization buffers appropriately. These stats are not persisted, just calculated during runtime - so it's not a "perfect" thing. The problem with existing stats is that we know the mean partition size but not the mean size of the ClusteringPrefixes which could be used to initialize the poor-man stats during startup. (Note: current TableMetrics.meanPartitionSize iterates over all sstables to retrieve the value)
          Hide
          mkjellman Michael Kjellman added a comment -

          Ariel Weisberg i'm having trouble keeping up with the number of diff's created from editing your comments on this thread. Any chance you could create new comments instead of editing the previous ones? thanks

          Show
          mkjellman Michael Kjellman added a comment - Ariel Weisberg i'm having trouble keeping up with the number of diff's created from editing your comments on this thread. Any chance you could create new comments instead of editing the previous ones? thanks
          Hide
          aweisberg Ariel Weisberg added a comment -

          Is that the entire partition's worth of index then that we're copying on heap?

          Yes

          Can we use the partition size statistics to automatically pick the appropriate implementation?

          Possibly. If we have two cache implementations at once how do you configure how big each one is? It seems tricky to do in one week.

          I think we would have better luck trying to get reference counting working so there doesn't need to be a copy. In a lot of cases the RIE is going to be enclosed by a Closable iterator anyways.

          Show
          aweisberg Ariel Weisberg added a comment - Is that the entire partition's worth of index then that we're copying on heap? Yes Can we use the partition size statistics to automatically pick the appropriate implementation? Possibly. If we have two cache implementations at once how do you configure how big each one is? It seems tricky to do in one week. I think we would have better luck trying to get reference counting working so there doesn't need to be a copy. In a lot of cases the RIE is going to be enclosed by a Closable iterator anyways.
          Hide
          jbellis Jonathan Ellis added a comment -

          For cache hits we have to copy the entire IndexedEntry onto the heap into unpooled memory. That is making an operation that was lg N a linear operation to the size of the IndexedEntry.

          Is that the entire partition's worth of index then that we're copying on heap?

          Can we use the partition size statistics to automatically pick the appropriate implementation?

          Show
          jbellis Jonathan Ellis added a comment - For cache hits we have to copy the entire IndexedEntry onto the heap into unpooled memory. That is making an operation that was lg N a linear operation to the size of the IndexedEntry. Is that the entire partition's worth of index then that we're copying on heap? Can we use the partition size statistics to automatically pick the appropriate implementation?
          Hide
          aweisberg Ariel Weisberg added a comment - - edited

          The additions to VIntCoding are now unused

          Moving the offsets buffer to the beginning is a mistake I think unless you plan to reallocate all the time. I don't think you should. When we are building these they are for new sstables that are likely to be compacted away so the time spent compacting the RIE is not productive? Also the sizing is off, the offsets are much smaller so writing to the offsets DataOutputBuffer will require more doubling/reallocation. I am not sure if it will reallocate to the correct size efficiently. If you are going through all that and you know the serialized size of everything anyways it seems like a waste just allocate the correct size BB manually and put everything in it.

          When deserializing an old format IndexedEntry I think you have to rewrite it to generate the offsets. Otherwise the cache will never be populated with an entry where the offsets are calculated. That will make it slower than the older version. I also think this will be faster than generating it incrementally since it's a nice tight loop doing a scan of memory. If you are doing a binary search you will end up doing most of that work anyways and if it's a scan you will also end up doing most of it.

          RowIndexEntry.java line 423, legacyIndexInfoSearch. It's still doing a loop from the beginning of the offsets to the last offset it calculated. There is no need we should know the last calculated offset available and skip to it. For a scan this operation becomes n^2 with that loop. I think it should go away completely. Just rewrite the IndexedEntry during deserialization since you are making a copy anyways when you bring it out of the field.

          Jonathan told me the expectation is that people run upgrade sstables so we don't need to be heroic. Let's go for the simples possible solution which is making the old and new formats match after deserialization. Hopefully this means we can remove a bunch of paths based in which format we are looking at.

          For cache hits we have to copy the entire IndexedEntry onto the heap into unpooled memory. That is making an operation that was lg N a linear operation to the size of the IndexedEntry. In terms of raw speed the on heap cache is going to be better off using the new serialization, but it will really poke the garbage collector in the eye. At least with the OHC cache the garbage is short lived.

          I don't like to give people options they have to choose from, but I am more afraid of making the product unworkable for some use case. Maybe we should allow the key cache to be selectable for 3.0? I feel like these are the two options that get us to 3.0 while minimizing regret post release.

          We have one week so it's less is more time.

          Show
          aweisberg Ariel Weisberg added a comment - - edited The additions to VIntCoding are now unused Moving the offsets buffer to the beginning is a mistake I think unless you plan to reallocate all the time. I don't think you should. When we are building these they are for new sstables that are likely to be compacted away so the time spent compacting the RIE is not productive? Also the sizing is off, the offsets are much smaller so writing to the offsets DataOutputBuffer will require more doubling/reallocation. I am not sure if it will reallocate to the correct size efficiently. If you are going through all that and you know the serialized size of everything anyways it seems like a waste just allocate the correct size BB manually and put everything in it. When deserializing an old format IndexedEntry I think you have to rewrite it to generate the offsets. Otherwise the cache will never be populated with an entry where the offsets are calculated. That will make it slower than the older version. I also think this will be faster than generating it incrementally since it's a nice tight loop doing a scan of memory. If you are doing a binary search you will end up doing most of that work anyways and if it's a scan you will also end up doing most of it. RowIndexEntry.java line 423, legacyIndexInfoSearch. It's still doing a loop from the beginning of the offsets to the last offset it calculated. There is no need we should know the last calculated offset available and skip to it. For a scan this operation becomes n^2 with that loop. I think it should go away completely. Just rewrite the IndexedEntry during deserialization since you are making a copy anyways when you bring it out of the field. Jonathan told me the expectation is that people run upgrade sstables so we don't need to be heroic. Let's go for the simples possible solution which is making the old and new formats match after deserialization. Hopefully this means we can remove a bunch of paths based in which format we are looking at. For cache hits we have to copy the entire IndexedEntry onto the heap into unpooled memory. That is making an operation that was lg N a linear operation to the size of the IndexedEntry. In terms of raw speed the on heap cache is going to be better off using the new serialization, but it will really poke the garbage collector in the eye. At least with the OHC cache the garbage is short lived. I don't like to give people options they have to choose from, but I am more afraid of making the product unworkable for some use case. Maybe we should allow the key cache to be selectable for 3.0? I feel like these are the two options that get us to 3.0 while minimizing regret post release. We have one week so it's less is more time.
          Hide
          slebresne Sylvain Lebresne added a comment -

          To put your mind at ease, I'm not particularly attached to any of the changes from CASSANDRA-10232. The goal of that ticket was to make very simple change that, given how we used RIE at the time, provided hopefully clear gains (not only for performance btw, getting smaller files on disk is a nice thing by itself). But if the uses of RIE change and other trade-offs makes more sense, so be it. Of course, the more we have time to validate those choices with benchmarks, the better.

          I am questioning this on the grounds that it requires walking the entire partition index and doing work even to random access?

          If you mean that the cost of iterating over the index to recompute the offsets could outweight the gain of reading less data, then you could be right but my hunch is that this probably doesn't make much measurable difference in practice. So the goal was more to save some space on disk than anything else (it's not hugely important, but it's a nice to have). But anyway, I'll remark that even if we want to avoid that walk, keeping both the width and offset of each entry is still redundant. The other alternative is to keep the offset but ditch the width, as the width can be recomputed from the current index offset and the next one (so without a full walk). The reason I didn't went with that alternative originally is that I suspect the code to read old entries (the backward compatibility code) will be more awkward. Anyway, if we keep vint encoding, I guess keeping both the width and offset is not such a big deal and I'm fine with that if that's simpler, but it's definitively possible to keep only one.

          Show
          slebresne Sylvain Lebresne added a comment - To put your mind at ease, I'm not particularly attached to any of the changes from CASSANDRA-10232 . The goal of that ticket was to make very simple change that, given how we used RIE at the time, provided hopefully clear gains (not only for performance btw, getting smaller files on disk is a nice thing by itself). But if the uses of RIE change and other trade-offs makes more sense, so be it. Of course, the more we have time to validate those choices with benchmarks, the better. I am questioning this on the grounds that it requires walking the entire partition index and doing work even to random access? If you mean that the cost of iterating over the index to recompute the offsets could outweight the gain of reading less data, then you could be right but my hunch is that this probably doesn't make much measurable difference in practice. So the goal was more to save some space on disk than anything else (it's not hugely important, but it's a nice to have). But anyway, I'll remark that even if we want to avoid that walk, keeping both the width and offset of each entry is still redundant. The other alternative is to keep the offset but ditch the width, as the width can be recomputed from the current index offset and the next one (so without a full walk). The reason I didn't went with that alternative originally is that I suspect the code to read old entries (the backward compatibility code) will be more awkward. Anyway, if we keep vint encoding, I guess keeping both the width and offset is not such a big deal and I'm fine with that if that's simpler, but it's definitively possible to keep only one.
          Hide
          aweisberg Ariel Weisberg added a comment -

          Thanks to Benedict for providing some clarity on this.

          So there are two general access paths to an Indexed RIE, there is a scan and a binary search to support random access. For the scan it is fine to materialize an entire IndexInfo. For the binary search case we don't want to materialize an IndexInfo object as this would hurt performance compared to the current POJO implementation.

          The current code has cases where it access fields from the IndexInfo by index. I would like to get away from that and just return the POJO and access fields from the POJO. As far as we know there is no degenerate case where it's pulling all the fields from different indexes interleaved.

          We are getting a big win from this compared to the POJO implementation simply by reducing the cost of loading/unloading an IndexedEntry to a memory copy, as well as reducing the cost of building an IndexedEntry by serializing it up front instead of building a list of POJOs and then coming back and serializing it.

          We should be able to preserve the use of vints. We should optimize the layout of an IndexInfo by having the clustering prefix field as the first field so that binary search doesn't have to do extra decoding. During a scan the cost of materializing and extra decoding (which we can avoid later if we want) is small compared to total operation cost for each entry materialized.

          Another optimization in addition to vints (and Benedict we didn't talk about this in the hangout) was dropping the offset field from IndexInfo.
          This was then recalculated in AbstractSSTableIterator.IndexState. I don't see a conflict between 9738 and this choice, but now I am questioning this on the grounds that it requires walking the entire partition index and doing work even to random access? I didn't pick up on that in my review of 10232. We have also agreed that because the IndexedEntries are sampling (and not per row or per partition) they are not as size constrained so keeping the field seems like the right choice.

          The last optimization from 10232 I want to consider is using the 64k WIDTH_BASE to reduce the size of the offset field. I don't see why we can't preserve that.

          We also want to keep the reduction in serializer allocations. I checked and at it looks like that has been preserved.

          Show
          aweisberg Ariel Weisberg added a comment - Thanks to Benedict for providing some clarity on this. So there are two general access paths to an Indexed RIE, there is a scan and a binary search to support random access. For the scan it is fine to materialize an entire IndexInfo. For the binary search case we don't want to materialize an IndexInfo object as this would hurt performance compared to the current POJO implementation. The current code has cases where it access fields from the IndexInfo by index. I would like to get away from that and just return the POJO and access fields from the POJO. As far as we know there is no degenerate case where it's pulling all the fields from different indexes interleaved. We are getting a big win from this compared to the POJO implementation simply by reducing the cost of loading/unloading an IndexedEntry to a memory copy, as well as reducing the cost of building an IndexedEntry by serializing it up front instead of building a list of POJOs and then coming back and serializing it. We should be able to preserve the use of vints. We should optimize the layout of an IndexInfo by having the clustering prefix field as the first field so that binary search doesn't have to do extra decoding. During a scan the cost of materializing and extra decoding (which we can avoid later if we want) is small compared to total operation cost for each entry materialized. Another optimization in addition to vints (and Benedict we didn't talk about this in the hangout) was dropping the offset field from IndexInfo. This was then recalculated in AbstractSSTableIterator.IndexState . I don't see a conflict between 9738 and this choice, but now I am questioning this on the grounds that it requires walking the entire partition index and doing work even to random access? I didn't pick up on that in my review of 10232. We have also agreed that because the IndexedEntries are sampling (and not per row or per partition) they are not as size constrained so keeping the field seems like the right choice. The last optimization from 10232 I want to consider is using the 64k WIDTH_BASE to reduce the size of the offset field. I don't see why we can't preserve that. We also want to keep the reduction in serializer allocations . I checked and at it looks like that has been preserved.
          Hide
          aweisberg Ariel Weisberg added a comment -

          Sylvain Lebresne
          For the OHC key cache we are looking at rolling back the core changes for #10232 because they get in the way of accessing the data off heap when we want to random access fields. Also the RIE is sampling so the value of slightly shrinking the entries seems low to me. There is a time/space tradeoff to make since we have to binary search the indexed RIE.

          Some of the other optimizations don't work like dropping fields and recalculating them because we don't want to process the index to read it just copy the bytes. I think this also prevented the optimization where you rebased the offset at 64k because that also requires a pass through the data to parse.

          I asked snazy to benchmark with large partitions with and without vints, but I think it's splitting hairs either way.

          It's a bad situation in general because when we cache/uncache these things we copy the entire thing which is yucky. It will be faster this way because at least it is a straight memory copy without an object graph or parsing.

          My pony for 3.0 would be to 64-bit map the entire index file, then get 32-bit ByteBuffer slices out of it and not use the key cache entirely, but there is a time constraint.

          I think just mapping it naively might get us 80 in the 80/20 of CASSANDRA-9754.

          Benedict also mentioned CASSANDRA-8931 which sounds like good next release fodder for making the index smaller.

          Show
          aweisberg Ariel Weisberg added a comment - Sylvain Lebresne For the OHC key cache we are looking at rolling back the core changes for #10232 because they get in the way of accessing the data off heap when we want to random access fields. Also the RIE is sampling so the value of slightly shrinking the entries seems low to me. There is a time/space tradeoff to make since we have to binary search the indexed RIE. Some of the other optimizations don't work like dropping fields and recalculating them because we don't want to process the index to read it just copy the bytes. I think this also prevented the optimization where you rebased the offset at 64k because that also requires a pass through the data to parse. I asked snazy to benchmark with large partitions with and without vints, but I think it's splitting hairs either way. It's a bad situation in general because when we cache/uncache these things we copy the entire thing which is yucky. It will be faster this way because at least it is a straight memory copy without an object graph or parsing. My pony for 3.0 would be to 64-bit map the entire index file, then get 32-bit ByteBuffer slices out of it and not use the key cache entirely, but there is a time constraint. I think just mapping it naively might get us 80 in the 80/20 of CASSANDRA-9754 . Benedict also mentioned CASSANDRA-8931 which sounds like good next release fodder for making the index smaller.
          Hide
          snazy Robert Stupp added a comment -

          I've got a couple of cstar perf tests done. Here's a short summary:

          • OHC-KC is at least as fast as vanilla 3.0, if not faster.
          • 3.0 generally beats 2.2 - especially with "big" partitions. Most of the remaining cases are mitigated with OHC-KC.
          • 2.2 shows better numbers in a view cases - especially reading small rows. But that's not a OHC-KC issue - it's also true for vanilla 3.0.
          • OHC-KC generally shows equal if not better GC pressure.

          Now going to write a "wall of text" describing OHC-KC changes/architecture and a more sophisticated performance test interpretation.

          Show
          snazy Robert Stupp added a comment - I've got a couple of cstar perf tests done. Here's a short summary: OHC-KC is at least as fast as vanilla 3.0, if not faster. 3.0 generally beats 2.2 - especially with "big" partitions. Most of the remaining cases are mitigated with OHC-KC. 2.2 shows better numbers in a view cases - especially reading small rows. But that's not a OHC-KC issue - it's also true for vanilla 3.0. OHC-KC generally shows equal if not better GC pressure. Now going to write a "wall of text" describing OHC-KC changes/architecture and a more sophisticated performance test interpretation.
          Hide
          snazy Robert Stupp added a comment -

          Some of the missing test-coverage is blocked by CASSANDRA-10237.
          Will work on the rest of the issues.

          Also scheduled a bunch of perf tests on cstar with a few but big partitions (100k and 500k rows, gaussian distribution).
          Will post a summary of these tests later.

          Show
          snazy Robert Stupp added a comment - Some of the missing test-coverage is blocked by CASSANDRA-10237 . Will work on the rest of the issues. Also scheduled a bunch of perf tests on cstar with a few but big partitions (100k and 500k rows, gaussian distribution). Will post a summary of these tests later.
          Hide
          aweisberg Ariel Weisberg added a comment -

          Just going off of coverage from "ant test" not including your work today. I have had some bad luck recently with emma claiming stuff is not covered. Hopefully it's accurate in these instances.

          • Serializers.LegacyClusteringPrefixSerializer.deserialize isn't tested. Is that blocked by CASSANDRA10237?
          • IndexInfo.skip isn't labeled as covered and neither is serializedSize
          • RowIndexEntry.equals is not covered, neither is hashCode, same for RIE.IndexedEntry
          • RowIndexEntry.Serializer.serialize doesn't test manually serializing the old version
          • RowIndexEntry.IndexedEntry constructor doesn't test loading the old version
          • RowIndexEntry.IndexedEntry.indexInfo doesn't test the old version, also the branch at the end where it stores the offset it happens to know about after finishing
          • RowIndexEntry.IndexedEntry.indexInfo does it have to do that loop for already discovered offsets? Can you store that value and not loop?
          • RIE.IndexedEntry.indexCount() could store the offset as a constant, also firstIndexInfoOffset
          • RIE.IndexedEntry.promotedSize for the old version not covered
          • Slice.Serializer.skip() is not tested/covered, neither is serialized size or part of deserialize, but you didn't change the latter two
          • RowIndexEntyrTest has commented out code

          Reviewing test contents now. Also going to update and look at your latest.

          Show
          aweisberg Ariel Weisberg added a comment - Just going off of coverage from "ant test" not including your work today. I have had some bad luck recently with emma claiming stuff is not covered. Hopefully it's accurate in these instances. Serializers.LegacyClusteringPrefixSerializer.deserialize isn't tested. Is that blocked by CASSANDRA10237? IndexInfo.skip isn't labeled as covered and neither is serializedSize RowIndexEntry.equals is not covered, neither is hashCode, same for RIE.IndexedEntry RowIndexEntry.Serializer.serialize doesn't test manually serializing the old version RowIndexEntry.IndexedEntry constructor doesn't test loading the old version RowIndexEntry.IndexedEntry.indexInfo doesn't test the old version, also the branch at the end where it stores the offset it happens to know about after finishing RowIndexEntry.IndexedEntry.indexInfo does it have to do that loop for already discovered offsets? Can you store that value and not loop? RIE.IndexedEntry.indexCount() could store the offset as a constant, also firstIndexInfoOffset RIE.IndexedEntry.promotedSize for the old version not covered Slice.Serializer.skip() is not tested/covered, neither is serialized size or part of deserialize, but you didn't change the latter two RowIndexEntyrTest has commented out code Reviewing test contents now. Also going to update and look at your latest.
          Hide
          benedict Benedict added a comment -

          It can actually be faster to compare the exact class instance, as this is pretty well optimised by hotspot (and should just involve comparing the object headers). It is unlikely to make much of a difference here, especially with the shallow class hierarchy here, just noting it.

          In this case it looks like for correctness in the RowIndexEntry class it needs to be an exact match anyway though.

          The IndexedEntry doesn't need to duplicate those checks though, since they're performed by the call to RIE.equals

          Show
          benedict Benedict added a comment - It can actually be faster to compare the exact class instance, as this is pretty well optimised by hotspot (and should just involve comparing the object headers). It is unlikely to make much of a difference here, especially with the shallow class hierarchy here, just noting it. In this case it looks like for correctness in the RowIndexEntry class it needs to be an exact match anyway though. The IndexedEntry doesn't need to duplicate those checks though, since they're performed by the call to RIE.equals
          Hide
          aweisberg Ariel Weisberg added a comment - - edited
          • This line could just be instanceof?
          • Same here? See Benedict's comment
          • I think we may want to consider not closing streams that wrap fixed buffers. The close operation does nothing, it throws an exception, and all that is going to do is generate useless instructions and prevent optimization. Maybe as a separate ticket we should go back and look for things that do that.
          • DataInputBuffer doesn't have an efficient skip method. The skip method implementation in super classes expects that it is wrapping a channel or stream and skipping requires reading. For a fixed ByteBuffer there is no need to allocate or copy memory. MemoryInputStream has the same issue in that it could be efficient but isn't. Maybe they should have a common base class since they both wrap memory like things and can have memory (as opposed to stream or channel) specific implementations.
          • Dead code
          • [I don't think you need to incorporate the buffer in the hash code. It's wasted cycles, position is sufficient? Not sure if we are using it.|https://github.com/apache/cassandra/compare/cassandra-3.0...snazy:9738-key-cache-ohc#diff-9b961593e363ad6a553e5af1ff1663b1R552}
          • What's the difference between the ClusteringPrefix serializer in Serializers and ClusteringPrefix? Did you have to add the legacy support?

          That's my reaction to the code. Working on tests and coverage now.

          Show
          aweisberg Ariel Weisberg added a comment - - edited This line could just be instanceof? Same here? See Benedict's comment I think we may want to consider not closing streams that wrap fixed buffers. The close operation does nothing, it throws an exception, and all that is going to do is generate useless instructions and prevent optimization. Maybe as a separate ticket we should go back and look for things that do that. DataInputBuffer doesn't have an efficient skip method. The skip method implementation in super classes expects that it is wrapping a channel or stream and skipping requires reading. For a fixed ByteBuffer there is no need to allocate or copy memory. MemoryInputStream has the same issue in that it could be efficient but isn't. Maybe they should have a common base class since they both wrap memory like things and can have memory (as opposed to stream or channel) specific implementations. Dead code [I don't think you need to incorporate the buffer in the hash code. It's wasted cycles, position is sufficient? Not sure if we are using it.|https://github.com/apache/cassandra/compare/cassandra-3.0...snazy:9738-key-cache-ohc#diff-9b961593e363ad6a553e5af1ff1663b1R552} What's the difference between the ClusteringPrefix serializer in Serializers and ClusteringPrefix? Did you have to add the legacy support? That's my reaction to the code. Working on tests and coverage now.
          Hide
          snazy Robert Stupp added a comment - - edited

          Had to rebuild the branch as rebase w/ resolving conflicts would have been like opening a can of worms.
          Unit tests look good and cstar shows the known numbers.

          The old state is still there for reference here and here (squashed).

          Optimizations in the "new" version are not that sophisticated as in the "old" one for backward compatibility as implemented by CASSANDRA-10237.

          Show
          snazy Robert Stupp added a comment - - edited Had to rebuild the branch as rebase w/ resolving conflicts would have been like opening a can of worms. Unit tests look good and cstar shows the known numbers. The old state is still there for reference here and here (squashed) . Optimizations in the "new" version are not that sophisticated as in the "old" one for backward compatibility as implemented by CASSANDRA-10237 .
          Hide
          mkjellman Michael Kjellman added a comment -

          I've been making a good amount of progress on 9754. I'm not sure if I can give you an exact date at this point but I'm hoping it will be in the near future.

          Show
          mkjellman Michael Kjellman added a comment - I've been making a good amount of progress on 9754. I'm not sure if I can give you an exact date at this point but I'm hoping it will be in the near future.
          Hide
          snazy Robert Stupp added a comment -

          Pushed some more commits.
          Includes serialization of IndexInfo offsets.
          Adds tests for RowIndexEntry class and legacy sstables based on CASSANDRA-10236 (latter not working yet due to CASSANDRA-10237).

          Open todo: benchmarking/comparison against "old" implementation, verification of compatibility with legacy sstables.

          Show
          snazy Robert Stupp added a comment - Pushed some more commits. Includes serialization of IndexInfo offsets. Adds tests for RowIndexEntry class and legacy sstables based on CASSANDRA-10236 (latter not working yet due to CASSANDRA-10237 ). Open todo: benchmarking/comparison against "old" implementation, verification of compatibility with legacy sstables.
          Hide
          snazy Robert Stupp added a comment -

          index info offsets in the serialization

          Yes, we can still change the on-disk format for indexes in 3.0. It's a todo for this ticket.

          Caching the last used IndexInfo is going to force a little bit of promotion

          Did that because some places in the code regularly access the "current" IndexInfo. Feels cheaper than having to deserialize the whole object multiple times.

          operate on the IndexInfo as a ByteBuffer

          Would be beneficial (could save some two memory copies per IndexInfo) - but as you said it's not easy. I gave it a quick try but ref-counting watchdog complained about unreleased references. I think we still save a lot with the serialized index-offsets and only deserializing the really needed objects.

          Reference counting shouldn't be too bad

          Seems doable (together with IndexInfo as a ByteBuffer). But I'd like to defer this to a follow-up ticket.

          I think we can get this working for 3.0 - so saving temporary garbage on the heap for cached and non-cached RIE.
          As I said in 9754, I think that RIE+II (including key cache) has probably reached its EOL with all the ongoing effort to optimize/replace the structure and algorithms.

          Show
          snazy Robert Stupp added a comment - index info offsets in the serialization Yes, we can still change the on-disk format for indexes in 3.0. It's a todo for this ticket. Caching the last used IndexInfo is going to force a little bit of promotion Did that because some places in the code regularly access the "current" IndexInfo. Feels cheaper than having to deserialize the whole object multiple times. operate on the IndexInfo as a ByteBuffer Would be beneficial (could save some two memory copies per IndexInfo) - but as you said it's not easy. I gave it a quick try but ref-counting watchdog complained about unreleased references. I think we still save a lot with the serialized index-offsets and only deserializing the really needed objects. Reference counting shouldn't be too bad Seems doable (together with IndexInfo as a ByteBuffer). But I'd like to defer this to a follow-up ticket. I think we can get this working for 3.0 - so saving temporary garbage on the heap for cached and non-cached RIE. As I said in 9754, I think that RIE+II (including key cache) has probably reached its EOL with all the ongoing effort to optimize/replace the structure and algorithms.
          Hide
          aweisberg Ariel Weisberg added a comment -

          I agree there is overlap in. The scope of what we need to make the key cache work is smaller, but it might be game over for the key cache and large partitions anyways even with Robert's improvements. Constantly loading and unloading large partitions is never going to work. We are at least reaching parity with the old key cache.

          If we could map the entire partition index then you could point this code at the mapping and it would work. It would not be ideal because what we are doing is a basic translation of the object graph which is an array pointing to variable size objects. Binary searches are going to result in access to two pages for each comparison. Possibly still better than what we have today?

          I think the trick right now is figuring out what we can do for 3.0, and what the next intermediate step is. If Robert goes and makes this work off-heap then the key cache can maybe go off heap for 3.0. Seems like removing the key cache and reliably operating against a memory map is unlikely for 3.0, but maybe shortly after.

          What would the timeline be for 9754? That kind of determines what intermediate steps make sense.

          Show
          aweisberg Ariel Weisberg added a comment - I agree there is overlap in. The scope of what we need to make the key cache work is smaller, but it might be game over for the key cache and large partitions anyways even with Robert's improvements. Constantly loading and unloading large partitions is never going to work. We are at least reaching parity with the old key cache. If we could map the entire partition index then you could point this code at the mapping and it would work. It would not be ideal because what we are doing is a basic translation of the object graph which is an array pointing to variable size objects. Binary searches are going to result in access to two pages for each comparison. Possibly still better than what we have today? I think the trick right now is figuring out what we can do for 3.0, and what the next intermediate step is. If Robert goes and makes this work off-heap then the key cache can maybe go off heap for 3.0. Seems like removing the key cache and reliably operating against a memory map is unlikely for 3.0, but maybe shortly after. What would the timeline be for 9754? That kind of determines what intermediate steps make sense.
          Hide
          kohlisankalp sankalp kohli added a comment -

          It looks like there is lot of overlap with CASSANDRA-9754

          Show
          kohlisankalp sankalp kohli added a comment - It looks like there is lot of overlap with CASSANDRA-9754
          Hide
          aweisberg Ariel Weisberg added a comment -

          OK, looks like what I was thinking at a high level. Since it is in progress I won't go deeper yet. I would include the index info offsets in the serialization and just populate them from the start so random access is simple. It's not going to be a worse memory impact than the existing Java object graph that has the array. Caching the last used IndexInfo is going to force a little bit of promotion. Not sure how much.

          It's not going to be easy, but if we can operate on the IndexInfo as a ByteBuffer we can avoid deserializing them during the binary search. Getting ClusteringComparator and ClusteringPrefix to that point does look hard. We measure as is and see if it's enough garbage and to matter.

          Reference counting shouldn't be too bad. In all the cases I could find the index info is consumed immediately and discarded or composed into an iterator implementing Closable. There is a bit of rope to hang to hang ourselves with there for long lived iterators pinning stuff in the cache. This is probably something we should test in OHC.

          Show
          aweisberg Ariel Weisberg added a comment - OK, looks like what I was thinking at a high level. Since it is in progress I won't go deeper yet. I would include the index info offsets in the serialization and just populate them from the start so random access is simple. It's not going to be a worse memory impact than the existing Java object graph that has the array. Caching the last used IndexInfo is going to force a little bit of promotion. Not sure how much. It's not going to be easy, but if we can operate on the IndexInfo as a ByteBuffer we can avoid deserializing them during the binary search. Getting ClusteringComparator and ClusteringPrefix to that point does look hard. We measure as is and see if it's enough garbage and to matter. Reference counting shouldn't be too bad. In all the cases I could find the index info is consumed immediately and discarded or composed into an iterator implementing Closable. There is a bit of rope to hang to hang ourselves with there for long lived iterators pinning stuff in the cache. This is probably something we should test in OHC.
          Hide
          snazy Robert Stupp added a comment - - edited

          Code's been refactored and unit tests have no failures, but some dtests failed recently - so it's not yet fully done.

          There are several commits on the branch.
          It starts with cleanup refactorings and preparation by encapsulating all RowIndexEntry and IndexInfo accesses.
          Next is introduction of ByteBuffer (744dde329a084a7fdab52fbbed11a15b296bab35), which is the "main" commit.
          Following commits are optimizations on top of that commit.

          The main change is that all IndexInfo objects are just present as a single (on-heap) ByteBuffer. IndexInfo objects are created when needed. This helps to reduce the amount of small, temporary and maybe unused objects - e.g. during binary search.

          I've got no benchmark in place yet and latest cstar tests neither show anything new nor a regression. I just didn't succeed to get really big partitions - so I'll try to build some artificial microbenchmarks to get some comparable numbers and prove that the approach also works for really big partitions (at least in theory it should, but who knows).

          The plan to do binary search in off-heap key cache or deserialize IndexInfo directly from off-heap or even just retrieve requested information without deserialization at all is not yet done. It requires more refactoring than I expected as RowIndexEntry and IndexInfo objects are used at many distinct places and "exposing" the off-heap ByteBuffer containing the serialized RowIndexEntry+IndexInfos requires proper reference counting (or risk off-heap memory leaks).

          EDIT: dtest build #53 show a regression in compaction throughput, which definitely needs to be fixed - but no functional failures as far as I can see.

          Show
          snazy Robert Stupp added a comment - - edited Code's been refactored and unit tests have no failures, but some dtests failed recently - so it's not yet fully done. There are several commits on the branch. It starts with cleanup refactorings and preparation by encapsulating all RowIndexEntry and IndexInfo accesses. Next is introduction of ByteBuffer (744dde329a084a7fdab52fbbed11a15b296bab35), which is the "main" commit. Following commits are optimizations on top of that commit. The main change is that all IndexInfo objects are just present as a single (on-heap) ByteBuffer. IndexInfo objects are created when needed. This helps to reduce the amount of small, temporary and maybe unused objects - e.g. during binary search. I've got no benchmark in place yet and latest cstar tests neither show anything new nor a regression. I just didn't succeed to get really big partitions - so I'll try to build some artificial microbenchmarks to get some comparable numbers and prove that the approach also works for really big partitions (at least in theory it should, but who knows). The plan to do binary search in off-heap key cache or deserialize IndexInfo directly from off-heap or even just retrieve requested information without deserialization at all is not yet done. It requires more refactoring than I expected as RowIndexEntry and IndexInfo objects are used at many distinct places and "exposing" the off-heap ByteBuffer containing the serialized RowIndexEntry+IndexInfos requires proper reference counting (or risk off-heap memory leaks). EDIT: dtest build #53 show a regression in compaction throughput, which definitely needs to be fixed - but no functional failures as far as I can see.
          Hide
          snazy Robert Stupp added a comment -

          If you can do it in time it's fine for 3.0.

          K, will try

          Show
          snazy Robert Stupp added a comment - If you can do it in time it's fine for 3.0. K, will try
          Hide
          aweisberg Ariel Weisberg added a comment -

          I haven't looked to deep into it yet, but why would it be out for 3.0? Is it not possible to bridge an off heap list with the on heap list implementation that is already there?

          The current formulation is out unless a benchmark of many rows per partition demonstrates no slow down compared to the POJO version on a read intensive benchmark with a cacheable distribution (not uniform).

          If you can do it in time it's fine for 3.0.

          Show
          aweisberg Ariel Weisberg added a comment - I haven't looked to deep into it yet, but why would it be out for 3.0? Is it not possible to bridge an off heap list with the on heap list implementation that is already there? The current formulation is out unless a benchmark of many rows per partition demonstrates no slow down compared to the POJO version on a read intensive benchmark with a cacheable distribution (not uniform). If you can do it in time it's fine for 3.0.
          Hide
          snazy Robert Stupp added a comment -

          yea - the idea's bad. It locks out big partitions (that really need the key cache). Assume that idea moved to /dev/null.

          Show
          snazy Robert Stupp added a comment - yea - the idea's bad. It locks out big partitions (that really need the key cache). Assume that idea moved to /dev/null.
          Hide
          mkjellman Michael Kjellman added a comment -

          Robert Stupp I had floated the idea of a size limit with Aleksey Yeschenko and he wasn't a fan

          Show
          mkjellman Michael Kjellman added a comment - Robert Stupp I had floated the idea of a size limit with Aleksey Yeschenko and he wasn't a fan
          Hide
          snazy Robert Stupp added a comment -

          It's just brain dumping an idea: we could change RowIndexEntry and related IndexInfo plus its comparators to be "off heap compatible" (means work on a ByteBuffer but not change the serialization format or algorithm itself). Having this would also save polluting the heap with RIE+II objects when reading it from disk as it could just load one ByteBuffer and operate on that.
          In other words: make RowIndexEntry a wrapper for a ByteBuffer without the need to deserialize any object structure during reads.

          CASSANDRA-9754 could further optimize this to handle really big partitions?

          But having this problem with big RowIndexEntries (lots of IndexInfo objects) I guess migrating KC off-heap is out of 3.0?

          Show
          snazy Robert Stupp added a comment - It's just brain dumping an idea: we could change RowIndexEntry and related IndexInfo plus its comparators to be "off heap compatible" (means work on a ByteBuffer but not change the serialization format or algorithm itself). Having this would also save polluting the heap with RIE+II objects when reading it from disk as it could just load one ByteBuffer and operate on that. In other words: make RowIndexEntry a wrapper for a ByteBuffer without the need to deserialize any object structure during reads. CASSANDRA-9754 could further optimize this to handle really big partitions? But having this problem with big RowIndexEntries (lots of IndexInfo objects) I guess migrating KC off-heap is out of 3.0?
          Hide
          snazy Robert Stupp added a comment -

          As an intermediate solution (it's actually more a workaround) we could just not add key-cache entries bigger than a configurable size (maybe 64kB?) and find a solution that works for both key-cache and disk (CASSANDRA-9754). Having something that can do the search in a ByteBuffer might be nice as it could be used in off-heap and on disk (assuming it's mmapped). /cc Michael Kjellman

          Show
          snazy Robert Stupp added a comment - As an intermediate solution (it's actually more a workaround) we could just not add key-cache entries bigger than a configurable size (maybe 64kB?) and find a solution that works for both key-cache and disk ( CASSANDRA-9754 ). Having something that can do the search in a ByteBuffer  might be nice as it could be used in off-heap and on disk (assuming it's mmapped). /cc Michael Kjellman
          Hide
          kohlisankalp sankalp kohli added a comment -

          Michael Kjellman is cooking something like this in CASSANDRA-9754. I will let him give more details.

          Show
          kohlisankalp sankalp kohli added a comment - Michael Kjellman is cooking something like this in CASSANDRA-9754 . I will let him give more details.
          Hide
          aweisberg Ariel Weisberg added a comment -

          For observers. We hit a snag. The key cache values can be quite large. There is an entry in each value for every row per partition so it can be in the thousands. This likely means that copying the entire thing on heap to operate on it once per read is not going to match the performance of the existing POJO implementation. Robert is going to benchmark a more representative configuration.

          It's a tractable problem but we will need to an off heap list implementation for variable size objects that supports binary search without materializing each entry in the search.

          Show
          aweisberg Ariel Weisberg added a comment - For observers. We hit a snag. The key cache values can be quite large. There is an entry in each value for every row per partition so it can be in the thousands. This likely means that copying the entire thing on heap to operate on it once per read is not going to match the performance of the existing POJO implementation. Robert is going to benchmark a more representative configuration. It's a tractable problem but we will need to an off heap list implementation for variable size objects that supports binary search without materializing each entry in the search.
          Hide
          aweisberg Ariel Weisberg added a comment - - edited

          How big is the realistic size range of key cache values?

          The only thing that makes me uncomfortable right now is the buffer handling for the serializing the key cache value in serializedSize(). Resizing in the loop is 4k at a time and not doubling. I think it at least needs to double to manage whatever the worst case is.

          It seems to me like the right tack on failure is to allocate the correct size buffer by calculating the size the old way or double (for lg). Doubling is fine if we really never expect sane data models to hit it. Doubling is not going to be great for people to hit under real world conditions especially if they have to do it several times.

          It's tempting to ask to have 10189 expanded to be "make this stuff fast" and then do it the old way in this change set, but I could be convinced if there isn't a really bad corner case.

          Also next time could you wait for +1 to squash? I can't see your last few change sets once you squash. I am guilty of this too and now I know how it feels.

          Show
          aweisberg Ariel Weisberg added a comment - - edited How big is the realistic size range of key cache values? The only thing that makes me uncomfortable right now is the buffer handling for the serializing the key cache value in serializedSize(). Resizing in the loop is 4k at a time and not doubling. I think it at least needs to double to manage whatever the worst case is. It seems to me like the right tack on failure is to allocate the correct size buffer by calculating the size the old way or double (for lg ). Doubling is fine if we really never expect sane data models to hit it. Doubling is not going to be great for people to hit under real world conditions especially if they have to do it several times. It's tempting to ask to have 10189 expanded to be "make this stuff fast" and then do it the old way in this change set, but I could be convinced if there isn't a really bad corner case. Also next time could you wait for +1 to squash? I can't see your last few change sets once you squash. I am guilty of this too and now I know how it feels.
          Hide
          snazy Robert Stupp added a comment -

          As discussed offline:

          • current branch uses its own string (de)serialization, is squashed and rebased
          • opened CASSANDRA-10189 as a follow-up for unified read/writeUTF code paths
          Show
          snazy Robert Stupp added a comment - As discussed offline: current branch uses its own string (de)serialization, is squashed and rebased opened CASSANDRA-10189 as a follow-up for unified read/writeUTF code paths
          Hide
          aweisberg Ariel Weisberg added a comment -

          For reading a string there is no reason not to use the last bit and allow strings 65k strings. It also doesn't throw if the length is too large.

          I really would like to get to the one true string reading/writing path if possible. So we can put together one good implementation and also to optimize icache. To allocate or pool temporary buffers is kind of a question of how allocation bound you are. I think the answer for C* right now is very.

          Is it feasible to share the same pooling buffer for both the key serialization and value serialization? Maybe a little finicky given that size is sometimes retrieved in an order that is different from the order they are serialized in.

          Show
          aweisberg Ariel Weisberg added a comment - For reading a string there is no reason not to use the last bit and allow strings 65k strings. It also doesn't throw if the length is too large. I really would like to get to the one true string reading/writing path if possible. So we can put together one good implementation and also to optimize icache. To allocate or pool temporary buffers is kind of a question of how allocation bound you are. I think the answer for C* right now is very. Is it feasible to share the same pooling buffer for both the key serialization and value serialization? Maybe a little finicky given that size is sometimes retrieved in an order that is different from the order they are serialized in.
          Hide
          snazy Robert Stupp added a comment -

          Branch is ready for another review round.
          It's rebased against current 3.0 and trunk.

          Changes are:

          • updated to ohc 0.4.2 (to get rid of ByteBuffer.order() calls and the try-finally-clauses)
          • no more duplicate serialization effort in serializedSize() + serialize() - uses a thread-local serialization buffer. Can be further improved with CASSANDRA-9929.
          • added/moved string serialization methods to ByteBufferUtil, which has distinct code paths for 7-bit ASCII and strings with wider chars
          Show
          snazy Robert Stupp added a comment - Branch is ready for another review round. It's rebased against current 3.0 and trunk. Changes are: updated to ohc 0.4.2 (to get rid of ByteBuffer.order() calls and the try-finally-clauses) no more duplicate serialization effort in serializedSize() + serialize() - uses a thread-local serialization buffer. Can be further improved with CASSANDRA-9929 . added/moved string serialization methods to ByteBufferUtil , which has distinct code paths for 7-bit ASCII and strings with wider chars
          Hide
          snazy Robert Stupp added a comment -

          Ouch, yes, names can contain UTF8 chars.
          Pushed a commit as a demo of how the lambda approach could look like. writeUTF could be nice - but readUTF would require two method refs (one for reading the len and one for reading bytes).
          read/writeUTF in java.io also use unsigned short for serialization.

          Hm - Mrs. cassci seems to be annoyed... ("Slave went offline during the build")

          Show
          snazy Robert Stupp added a comment - Ouch, yes, names can contain UTF8 chars. Pushed a commit as a demo of how the lambda approach could look like. writeUTF could be nice - but readUTF would require two method refs (one for reading the len and one for reading bytes). read/writeUTF in java.io also use unsigned short for serialization. Hm - Mrs. cassci seems to be annoyed... ("Slave went offline during the build")
          Hide
          aweisberg Ariel Weisberg added a comment -

          In OHCKeyCache when calculating the length of strings you are calculating the length using String.length() which returns a wrong answer if you encode using UTF-8.

          We went to some length to come up with an "optimal" no garbage string writing function for BufferedDataOutputStream (you can see it as a static method in UnbufferedDataOutputStream). It would be great if we could do the same thing here and not allocate byte arrays for the encoded names. Would it be crazy for it take a lambda that describes how to write the generated byte array to some output? Then we could use the same code everywhere. Benedict what do you think?

          Since you are using a short length prefix what is your level of confidence it will always be enough? How does it fail if it isn't?

          The 2i path testing is much appreciated. The utests didn't seem to make it in cassci.

          Show
          aweisberg Ariel Weisberg added a comment - In OHCKeyCache when calculating the length of strings you are calculating the length using String.length() which returns a wrong answer if you encode using UTF-8. We went to some length to come up with an "optimal" no garbage string writing function for BufferedDataOutputStream (you can see it as a static method in UnbufferedDataOutputStream). It would be great if we could do the same thing here and not allocate byte arrays for the encoded names. Would it be crazy for it take a lambda that describes how to write the generated byte array to some output? Then we could use the same code everywhere. Benedict what do you think? Since you are using a short length prefix what is your level of confidence it will always be enough? How does it fail if it isn't? The 2i path testing is much appreciated. The utests didn't seem to make it in cassci.
          Hide
          snazy Robert Stupp added a comment -

          Pushed changes to existing branches (plus a new branch for merge to trunk). Included is a new KeyCacheCqlTest to test 2i and large partitions (verified via key cache metrics) and addressing the review comments (in a separate commit).

          Show
          snazy Robert Stupp added a comment - Pushed changes to existing branches (plus a new branch for merge to trunk). Included is a new KeyCacheCqlTest to test 2i and large partitions (verified via key cache metrics) and addressing the review comments (in a separate commit).
          Hide
          snazy Robert Stupp added a comment -

          Thanks

          How are the 2i paths tested?

          Currently implicitly via other unit tests. But I'll add some unit test for that.

          The null case in makeVal isn't tested, maybe not that interesting

          Oh. Although it doesn't cause any problems, passing null to OHC is not allowed. I've added an assertion.

          SerializationHeader forKeyCache

          Yes. It's intentionally racy to prevent blocking operations. The chance for such a race is probably low and "heals" itself. Added a comment for that.

          comment about singleton weigher

          removed

          NIODataInputStream has a derived class DataInputBuffer

          changed to use DIB

          string encoding and decoding helpers

          yup - makes sense. refactored.

          An enhancement we can file for later is to replace those strings with vints that reference a map of possible table names.

          My idea is to remove strings at all from the key cache. Keyspace + CF names can be handled by CASSANDRA-10028. Not sure how to handle file/path names - maybe using some sparse list structure for sstable generations (in another ticket).

          Haven't pushed anything yet - but will update my branch soon.

          Show
          snazy Robert Stupp added a comment - Thanks How are the 2i paths tested? Currently implicitly via other unit tests. But I'll add some unit test for that. The null case in makeVal isn't tested, maybe not that interesting Oh. Although it doesn't cause any problems, passing null to OHC is not allowed. I've added an assertion. SerializationHeader forKeyCache Yes. It's intentionally racy to prevent blocking operations. The chance for such a race is probably low and "heals" itself. Added a comment for that. comment about singleton weigher removed NIODataInputStream has a derived class DataInputBuffer changed to use DIB string encoding and decoding helpers yup - makes sense. refactored. An enhancement we can file for later is to replace those strings with vints that reference a map of possible table names. My idea is to remove strings at all from the key cache. Keyspace + CF names can be handled by CASSANDRA-10028 . Not sure how to handle file/path names - maybe using some sparse list structure for sstable generations (in another ticket). Haven't pushed anything yet - but will update my branch soon.
          Hide
          aweisberg Ariel Weisberg added a comment -

          Good stuff. Coverage of the OHC key cache provider looks good.

          • How are the 2i paths tested?
          • The null case in makeVal isn't tested, maybe not that interesting
          • SerializationHeader forKeyCache is racy and can result in an undersize array clobbering a properly sized one. But... it doesn't retrieve the value it sets so odds are eventually it will work out to be the longer one. It works it's just intentionally racy.
          • In CacheService does that comment about singleton weigher even make sense anymore?
          • NIODataInputStream has a derived class DataInputBuffer that exposes the constructor you made public.
          • The string encoding and decoding helpers you wrote seem like they should be factored out somewhere else, maybe ByteBufferUtil? Also you don't specify a string encoding and there may be some issues with serialized size of non-latin characters lurking as well.
          • An enhancement we can file for later is to replace those strings with vints that reference a map of possible table names. For persistence definitely fully qualify, but in memory we can store more entries that way.
          Show
          aweisberg Ariel Weisberg added a comment - Good stuff. Coverage of the OHC key cache provider looks good. How are the 2i paths tested? The null case in makeVal isn't tested, maybe not that interesting SerializationHeader forKeyCache is racy and can result in an undersize array clobbering a properly sized one. But... it doesn't retrieve the value it sets so odds are eventually it will work out to be the longer one. It works it's just intentionally racy. In CacheService does that comment about singleton weigher even make sense anymore? NIODataInputStream has a derived class DataInputBuffer that exposes the constructor you made public. The string encoding and decoding helpers you wrote seem like they should be factored out somewhere else, maybe ByteBufferUtil? Also you don't specify a string encoding and there may be some issues with serialized size of non-latin characters lurking as well. An enhancement we can file for later is to replace those strings with vints that reference a map of possible table names. For persistence definitely fully qualify, but in memory we can store more entries that way.
          Hide
          snazy Robert Stupp added a comment -

          Code coverage looks not bad so far. Important code is covered (by nature) from nearly every unit test.
          getter/setter on CacheService for all caches are uncovered.
          Key cache's hotKeyIterator() is also uncovered.
          Will add unit tests for them in the respective test classes.

          Show
          snazy Robert Stupp added a comment - Code coverage looks not bad so far. Important code is covered (by nature) from nearly every unit test. getter/setter on CacheService for all caches are uncovered. Key cache's hotKeyIterator() is also uncovered. Will add unit tests for them in the respective test classes.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Ariel Weisberg Could you review?

          Show
          iamaleksey Aleksey Yeschenko added a comment - Ariel Weisberg Could you review?
          Hide
          jbellis Jonathan Ellis added a comment -

          What does code coverage show? Because I know that would be Ariel's first question.

          Show
          jbellis Jonathan Ellis added a comment - What does code coverage show? Because I know that would be Ariel's first question.
          Hide
          snazy Robert Stupp added a comment -

          I’d like to propose this patch to be included in 3.0. I hope the cstar tests are sufficient but otherwise I can deliver more with different workloads.

          cstar tests

          All cstar tests mentioned below perform three operations: write-only, mixed and read-only.
          Unfortunately, cassandra-stress seems to reduce the really possible write throughput for workloads with clustering keys.

          All tests on this patch show reduced GC pressure (for reads, of course).
          By that it gives G1 more headroom to operate and and often gains about 10-15% read perf improvement depending on the hardware (in this case bdplab vs. blade_11_b) - bdplab (spinning disks, less RAM) shows a bigger improvement.

          one big clustering key

          user, native, cql3, user profile

          blade_11_b
          bdplab

          big clustering over two clustering columns

          user, native, cql3, user profile

          blade_11_b
          bdplab

          big clustering over two clustering columns, reduced threads for pure-write and mixed operations

          user, native, cql3, user profile

          blade_11_b
          bdplab

          stress write, mixed, read

          blade_11_b
          bdplab

          Git branch + cassci

          git branch
          unit tests
          dtests

          I didn’t see any failed tests related to this patch.

          There is another branch on github as well which contains optimizations not purely related to key-cache. 9738-key-cache-ohc is based on that branch and contains:

          • “singletons” for key-cache o.a.c.db.SerializationHeader instances (dynamically extended, if required)
          • “singletons” for IndexInfo.Serializer in o.a.c.db.Serializers (dynamically extended, if required)
          • “singletons” for BigVersion instances for ma, la, ka, jb - other versions get temporary objects (some tests use older sstable versions)

          Further optimisations

          There are some things that can be optimised in the future:

          • Currently we need to serialise keyspace and cf names and cfId. This is necessary since cfID of secondary indexes is inherited from the base table. If all tables and all secondary indexes have unique IDs, we can omit KS and CF name serialisation (and it’s weird cfName.contains(‘.’) 2i detection). Can be built with or after 2i API redesign.
          • The full directory path is serialised. This appears to be less expensive than iterating of the whole List of sstables and identifying an sstable by its generation.
          • As Benedict suggested, we can switch to very tiny key-cache entries and also omit serialisation of IndexInfo.
          Show
          snazy Robert Stupp added a comment - I’d like to propose this patch to be included in 3.0. I hope the cstar tests are sufficient but otherwise I can deliver more with different workloads. cstar tests All cstar tests mentioned below perform three operations: write-only, mixed and read-only. Unfortunately, cassandra-stress seems to reduce the really possible write throughput for workloads with clustering keys. All tests on this patch show reduced GC pressure (for reads, of course). By that it gives G1 more headroom to operate and and often gains about 10-15% read perf improvement depending on the hardware (in this case bdplab vs. blade_11_b) - bdplab (spinning disks, less RAM) shows a bigger improvement. one big clustering key user, native, cql3, user profile blade_11_b bdplab big clustering over two clustering columns user, native, cql3, user profile blade_11_b bdplab big clustering over two clustering columns, reduced threads for pure-write and mixed operations user, native, cql3, user profile blade_11_b bdplab stress write , mixed , read blade_11_b bdplab Git branch + cassci git branch unit tests dtests I didn’t see any failed tests related to this patch. There is another branch on github as well which contains optimizations not purely related to key-cache . 9738-key-cache-ohc is based on that branch and contains: “singletons” for key-cache o.a.c.db.SerializationHeader instances (dynamically extended, if required) “singletons” for IndexInfo.Serializer in o.a.c.db.Serializers (dynamically extended, if required) “singletons” for BigVersion instances for ma , la , ka , jb - other versions get temporary objects (some tests use older sstable versions) Further optimisations There are some things that can be optimised in the future: Currently we need to serialise keyspace and cf names and cfId. This is necessary since cfID of secondary indexes is inherited from the base table. If all tables and all secondary indexes have unique IDs, we can omit KS and CF name serialisation (and it’s weird cfName.contains(‘.’) 2i detection). Can be built with or after 2i API redesign. The full directory path is serialised. This appears to be less expensive than iterating of the whole List of sstables and identifying an sstable by its generation. As Benedict suggested, we can switch to very tiny key-cache entries and also omit serialisation of IndexInfo .
          Hide
          snazy Robert Stupp added a comment -

          sankalp kohli ok

          Here are two more cstar results using this profile.
          blade_11_b shows the same OHC having the same throughput as 2.2 but with less GC pauses ; both are faster than 3.0 alone.
          bdplab shows OHC having better throughput than 2.2 and 3.0 ; also less GC effort.
          Both environments show this behaviour in every test - guess, it's ok to blame it spinning disks and less RAM.

          Show
          snazy Robert Stupp added a comment - sankalp kohli ok Here are two more cstar results using this profile . blade_11_b shows the same OHC having the same throughput as 2.2 but with less GC pauses ; both are faster than 3.0 alone. bdplab shows OHC having better throughput than 2.2 and 3.0 ; also less GC effort. Both environments show this behaviour in every test - guess, it's ok to blame it spinning disks and less RAM.
          Hide
          kohlisankalp sankalp kohli added a comment -

          Yes I am +1 on the change. We should test this with different partition sizes.

          Show
          kohlisankalp sankalp kohli added a comment - Yes I am +1 on the change. We should test this with different partition sizes.
          Hide
          snazy Robert Stupp added a comment -

          sankalp kohli all previous tests were "plain old simple stress" profiles - so no big partitions. The current and next cstar tests have different stress profiles with different partition sizes. I appreciate suggestions for real-world stress profiles.
          But even if this gives no throughput benefit, reduced GC pressure (or not letting objects "escape" from new-gen) is worth the change.

          Show
          snazy Robert Stupp added a comment - sankalp kohli all previous tests were "plain old simple stress" profiles - so no big partitions. The current and next cstar tests have different stress profiles with different partition sizes. I appreciate suggestions for real-world stress profiles. But even if this gives no throughput benefit, reduced GC pressure (or not letting objects "escape" from new-gen) is worth the change.
          Hide
          kohlisankalp sankalp kohli added a comment -

          Robert Stupp In your test, how big are the CQL partitions and how many IndexInfo objects are being generated? That will determine the improvement we will see.

          Also since we need to deserialize these objects on key cache hit, I am not sure how it will affect large CQL partitions.

          Show
          kohlisankalp sankalp kohli added a comment - Robert Stupp In your test, how big are the CQL partitions and how many IndexInfo objects are being generated? That will determine the improvement we will see. Also since we need to deserialize these objects on key cache hit, I am not sure how it will affect large CQL partitions.
          Hide
          snazy Robert Stupp added a comment -

          FTR: I've added a chunked implementation to OHC but it is not included in this ticket. A thorough review of that chunked implementation would be required. The chunked impl has the advantage to reduce malloc/free-rate to nearly 0 at the cost that the key-cache cannot be resized dynamically (at least yet).

          Show
          snazy Robert Stupp added a comment - FTR: I've added a chunked implementation to OHC but it is not included in this ticket. A thorough review of that chunked implementation would be required. The chunked impl has the advantage to reduce malloc/free-rate to nearly 0 at the cost that the key-cache cannot be resized dynamically (at least yet).
          Hide
          benedict Benedict added a comment -

          Yes. Thanks

          Show
          benedict Benedict added a comment - Yes. Thanks
          Hide
          jbellis Jonathan Ellis added a comment -

          Did you mean to link a different issue?

          Show
          jbellis Jonathan Ellis added a comment - Did you mean to link a different issue?
          Hide
          benedict Benedict added a comment - - edited

          Looks like improved occupancy is also a factor (comparing run 2 to run 3). I'd prefer to eliminate the malloc/free from cache maintenance before we hook OHC into the key cache, but these numbers are pretty compelling and we should probably consider it for 3.0.

          I suspect CASSANDRA-8931 will permit us to deliver a more efficient and high occupancy key-cache, but that shouldn't prevent us doing something in the meantime.

          Show
          benedict Benedict added a comment - - edited Looks like improved occupancy is also a factor (comparing run 2 to run 3). I'd prefer to eliminate the malloc/free from cache maintenance before we hook OHC into the key cache, but these numbers are pretty compelling and we should probably consider it for 3.0. I suspect CASSANDRA-8931 will permit us to deliver a more efficient and high occupancy key-cache, but that shouldn't prevent us doing something in the meantime.
          Hide
          snazy Robert Stupp added a comment - - edited

          This cstar test is probably interesting. It compares 2.2, trunk, trunk w/ ParNewCMS, 9738, 9738 w/ ParNewCMS. The reason for the additional ParNewCMS configs is that it was the only reasonable change that can explain the weird throughput graphs (and huge GC pauses) for trunk.
          After some offline discussion yesterday with Benedict and T Jake Luciani, comparing G1 w/ ParNewCMS we concluded that G1 is "the cause" of the weird graphs + long GC pauses for trunk. G1 requires more head room to perform better. And 9738 gives it more head room (by moving stuff from heap to off-heap).

          Show
          snazy Robert Stupp added a comment - - edited This cstar test is probably interesting. It compares 2.2, trunk, trunk w/ ParNewCMS, 9738, 9738 w/ ParNewCMS. The reason for the additional ParNewCMS configs is that it was the only reasonable change that can explain the weird throughput graphs (and huge GC pauses) for trunk. After some offline discussion yesterday with Benedict and T Jake Luciani , comparing G1 w/ ParNewCMS we concluded that G1 is "the cause" of the weird graphs + long GC pauses for trunk. G1 requires more head room to perform better. And 9738 gives it more head room (by moving stuff from heap to off-heap).
          Hide
          jbellis Jonathan Ellis added a comment -

          My next question was going to be why GC activity went up but I see you edited the table.

          Very promising!

          Show
          jbellis Jonathan Ellis added a comment - My next question was going to be why GC activity went up but I see you edited the table. Very promising!
          Hide
          snazy Robert Stupp added a comment -

          why would the cache improvements improve write speed?

          TBH - I was surprised by that, too. I guess, it's key cache invalidation during compaction (SSTablRewriter.InvalidateKeys).

          Show
          snazy Robert Stupp added a comment - why would the cache improvements improve write speed? TBH - I was surprised by that, too. I guess, it's key cache invalidation during compaction ( SSTablRewriter.InvalidateKeys ).
          Hide
          jbellis Jonathan Ellis added a comment -

          That definitely looks interesting. But why would the cache improvements improve write speed?

          Show
          jbellis Jonathan Ellis added a comment - That definitely looks interesting. But why would the cache improvements improve write speed?
          Hide
          snazy Robert Stupp added a comment - - edited

          I did some coding just to see what this might buy us.
          cstar bench gives these nice numbers:

            reads/sec writes/sec total gc time, 2nd read
          trunk 120k 122k 23sec (104 gcs, 222ms avg)
          with CASSANDRA-9718 128k 131k 24sec (106 gcs, 222ms avg)
          with key-cache off-heap 163k 136k 2sec (111 gcs, 18ms avg)

          EDIT: corrected accidentally misplaced GC info

          Show
          snazy Robert Stupp added a comment - - edited I did some coding just to see what this might buy us. cstar bench gives these nice numbers:   reads/sec writes/sec total gc time, 2nd read trunk 120k 122k 23sec (104 gcs, 222ms avg) with CASSANDRA-9718 128k 131k 24sec (106 gcs, 222ms avg) with key-cache off-heap 163k 136k 2sec (111 gcs, 18ms avg) EDIT: corrected accidentally misplaced GC info

            People

            • Assignee:
              snazy Robert Stupp
              Reporter:
              snazy Robert Stupp
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development