Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 2.0 beta 1
    • Component/s: Core
    • Labels:
      None

      Description

      IndexSummary can still use a lot of heap for narrow-row sstables. (It can also contribute to memory fragmentation because of the large arrays it creates.)

        Activity

        Hide
        Vijay added a comment -

        Pushed the changes to https://github.com/Vijay2win/cassandra/commits/5521

        The idea is as mentioned in CASSANDRA-5506 comments, Since this ticket is marked for 2.0 i took the liberty of changing the index summary file format.
        I am going to spend sometime on testing the performance difference (If any). Thanks!

        Show
        Vijay added a comment - Pushed the changes to https://github.com/Vijay2win/cassandra/commits/5521 The idea is as mentioned in CASSANDRA-5506 comments, Since this ticket is marked for 2.0 i took the liberty of changing the index summary file format. I am going to spend sometime on testing the performance difference (If any). Thanks!
        Hide
        Jonathan Ellis added a comment -

        Seems inefficient to make the "get pair" the unit of fetching memory, since usually you want the key or the position but not both. (You wouldn't have to change the signature of getKey either, if you just fetched the key into a byte[] directly instead of wrapping half of it.)

        What's the upgrade path? Would prefer "automatically rebuilds old summaries" to "user has to manually blow away summaries or it dies trying to start."

        Show
        Jonathan Ellis added a comment - Seems inefficient to make the "get pair" the unit of fetching memory, since usually you want the key or the position but not both. (You wouldn't have to change the signature of getKey either, if you just fetched the key into a byte[] directly instead of wrapping half of it.) What's the upgrade path? Would prefer "automatically rebuilds old summaries" to "user has to manually blow away summaries or it dies trying to start."
        Hide
        Vijay added a comment -

        Seems inefficient to make the "get pair" the unit of fetching memory,

        Ahaa good point will fix it.

        What's the upgrade path? Would prefer "automatically rebuilds old summaries" to "user has to manually blow away summaries or it dies trying to start."

        It is automatic, but until the user runs scrub or until the new SST's are created the startup will be slow.

        Show
        Vijay added a comment - Seems inefficient to make the "get pair" the unit of fetching memory, Ahaa good point will fix it. What's the upgrade path? Would prefer "automatically rebuilds old summaries" to "user has to manually blow away summaries or it dies trying to start." It is automatic, but until the user runs scrub or until the new SST's are created the startup will be slow.
        Hide
        Vijay added a comment -

        Pushed an update to https://github.com/Vijay2win/cassandra/commits/5521_v2

        This update doesn't use int[] which v1 used, v2 uses kind of offheap index over index summary which is stored at the beginning of the memory block... Moves index summary completely offheap...

        I was not able to see any difference in performance between trunk and v2 using stress tool.
        Micro benchmark shows v2 is 6 seconds slower for 20 Bilion get's hence it is was not noticeable in stress tool, but the real benefit is less pauses and more data a node can hold...

        Show
        Vijay added a comment - Pushed an update to https://github.com/Vijay2win/cassandra/commits/5521_v2 This update doesn't use int[] which v1 used, v2 uses kind of offheap index over index summary which is stored at the beginning of the memory block... Moves index summary completely offheap... I was not able to see any difference in performance between trunk and v2 using stress tool. Micro benchmark shows v2 is 6 seconds slower for 20 Bilion get's hence it is was not noticeable in stress tool, but the real benefit is less pauses and more data a node can hold...
        Hide
        Pavel Yaskevich added a comment - - edited

        Have we considered using vint encoding on those arrays as we keep them in memory anyway to minimize space consumption?

        Edit: i remember now why that is not a good idea I wonder though how what could memory footprint be if we use TreeMap inside and keys and offsets (in vint encoding) saved in native memory...

        Show
        Pavel Yaskevich added a comment - - edited Have we considered using vint encoding on those arrays as we keep them in memory anyway to minimize space consumption? Edit: i remember now why that is not a good idea I wonder though how what could memory footprint be if we use TreeMap inside and keys and offsets (in vint encoding) saved in native memory...
        Hide
        Pavel Yaskevich added a comment -

        I see v2 does byte[] allocation on every getKey(int) call which would be happening very frequently due to binary search which happens on very index lookup. So I don't think there is any real benefit in terms of GC friendliness from moving off-heap in this case as we have to copy data over multiple times anyway.

        As an alternative to Unsafe we can try hybrid approach - identify if JNA is present and put summaries off-heap (using JNA's Memory) in combination with Pointer.getByteBuffer() which doesn't copy any data around but instead creates direct ByteBuffer, otherwise have IndexSummary on-heap but split byte[][] and long[] into pages so we don't have to allocate contiguous space for big SSTables which would be much GC friendlier.

        Show
        Pavel Yaskevich added a comment - I see v2 does byte[] allocation on every getKey(int) call which would be happening very frequently due to binary search which happens on very index lookup. So I don't think there is any real benefit in terms of GC friendliness from moving off-heap in this case as we have to copy data over multiple times anyway. As an alternative to Unsafe we can try hybrid approach - identify if JNA is present and put summaries off-heap (using JNA's Memory) in combination with Pointer.getByteBuffer() which doesn't copy any data around but instead creates direct ByteBuffer, otherwise have IndexSummary on-heap but split byte[][] and long[] into pages so we don't have to allocate contiguous space for big SSTables which would be much GC friendlier.
        Hide
        Jonathan Ellis added a comment -

        As Pavel notes, the most important use of getKey is the one in binarySearch. But here we only care about the comparison, we don't actually need the artifact of a ByteBuffer. So why not compare directly without creating a buffer first? No buffer at all is even cheaper than a native buffer.

        (This would also mean that we only need to look at as many bytes as it takes before the first difference is found.)

        Show
        Jonathan Ellis added a comment - As Pavel notes, the most important use of getKey is the one in binarySearch. But here we only care about the comparison, we don't actually need the artifact of a ByteBuffer. So why not compare directly without creating a buffer first? No buffer at all is even cheaper than a native buffer. (This would also mean that we only need to look at as many bytes as it takes before the first difference is found.)
        Hide
        Jonathan Ellis added a comment -

        So why not compare directly without creating a buffer first?

        That doesn't quite work, since we need to compare the Token, not the key itself.

        So, revised suggestion: allow the partitioner to decorate a Memory location, instead of forcing us to create a ByteBuffer first.

        Show
        Jonathan Ellis added a comment - So why not compare directly without creating a buffer first? That doesn't quite work, since we need to compare the Token, not the key itself. So, revised suggestion: allow the partitioner to decorate a Memory location, instead of forcing us to create a ByteBuffer first.
        Hide
        Pavel Yaskevich added a comment -

        if we do so we would have to change partitioner interface to decorateKey and getToken, DecoratedKey to have two "key" fields and change at least MurmurHash to accept both BB and M at the same time. I'm my opinion it's just too many changes just because we don't require JNA but with hybrid approach we don't have to do any of that work. Besides mentioned users would want to run with JNA in production anyway.

        Show
        Pavel Yaskevich added a comment - if we do so we would have to change partitioner interface to decorateKey and getToken, DecoratedKey to have two "key" fields and change at least MurmurHash to accept both BB and M at the same time. I'm my opinion it's just too many changes just because we don't require JNA but with hybrid approach we don't have to do any of that work. Besides mentioned users would want to run with JNA in production anyway.
        Hide
        Jonathan Ellis added a comment -

        Just add a IPartitioner.compareToken method that does what we need, then. Much better than creating extra objects that we don't care about.

        I like where we are now, with JNA being mostly optional (more so in 2.0 where we require Java7, so we don't need JNA for snapshot). Remember, we don't support JNA at all on Windows. I'd rather use less JNA than more.

        Show
        Jonathan Ellis added a comment - Just add a IPartitioner.compareToken method that does what we need, then. Much better than creating extra objects that we don't care about. I like where we are now, with JNA being mostly optional (more so in 2.0 where we require Java7, so we don't need JNA for snapshot). Remember, we don't support JNA at all on Windows. I'd rather use less JNA than more.
        Hide
        Pavel Yaskevich added a comment -

        But don't you still have to generate token from Memory and make changes to getToken(BB) and underlying methods or was is proposed interface for compareToken?

        I like where we are now, with JNA being mostly optional (more so in 2.0 where we require Java7, so we don't need JNA for snapshot). Remember, we don't support JNA at all on Windows. I'd rather use less JNA than more.

        I just to clarify, the only thing we need from JNA in this case is Pointer.getByteBuffer() which is actually a JNI method but unfortunately Unsafe doesn't have it somehow

        I agree tho that we should rely less on JNA but in this case we still pay the price of memory usage even putting off-heap so in non-JNA case we make it GC/allocation friendly it still makes a good improvement overall with almost no code changes.

        Show
        Pavel Yaskevich added a comment - But don't you still have to generate token from Memory and make changes to getToken(BB) and underlying methods or was is proposed interface for compareToken? I like where we are now, with JNA being mostly optional (more so in 2.0 where we require Java7, so we don't need JNA for snapshot). Remember, we don't support JNA at all on Windows. I'd rather use less JNA than more. I just to clarify, the only thing we need from JNA in this case is Pointer.getByteBuffer() which is actually a JNI method but unfortunately Unsafe doesn't have it somehow I agree tho that we should rely less on JNA but in this case we still pay the price of memory usage even putting off-heap so in non-JNA case we make it GC/allocation friendly it still makes a good improvement overall with almost no code changes.
        Hide
        Jonathan Ellis added a comment -

        Thinking about it, getToken(Memory, offset, length) is probably the right thing to add to IPartitioner. The rest can live in IndexSummary. That doesn't sound like a huge burden to me. And as I said, creating no Buffer (or DK) objects is better than creating native ones.

        Show
        Jonathan Ellis added a comment - Thinking about it, getToken(Memory, offset, length) is probably the right thing to add to IPartitioner. The rest can live in IndexSummary. That doesn't sound like a huge burden to me. And as I said, creating no Buffer (or DK) objects is better than creating native ones.
        Hide
        Pavel Yaskevich added a comment -

        So tokens that keep bb around right now would have to keep Memory and offset, size references? I'm not against this, just trying to clarify to myself if we would rather want to keep some kind of ROBuffer container for DK and Token to unify interface...

        Show
        Pavel Yaskevich added a comment - So tokens that keep bb around right now would have to keep Memory and offset, size references? I'm not against this, just trying to clarify to myself if we would rather want to keep some kind of ROBuffer container for DK and Token to unify interface...
        Hide
        Jonathan Ellis added a comment - - edited

        No, the Token wouldn't share any bytes with the key (with the exception of BOP, and I don't care about optimizing for that case), so there's no reason to not create a regular, on-heap Token.

        Edit: keeping in mind that we don't actually need a full DK object, we just need a Token.

        Show
        Jonathan Ellis added a comment - - edited No, the Token wouldn't share any bytes with the key (with the exception of BOP, and I don't care about optimizing for that case), so there's no reason to not create a regular, on-heap Token. Edit: keeping in mind that we don't actually need a full DK object, we just need a Token.
        Hide
        Vijay added a comment - - edited

        Honestly, glad to see the thread going in the same thinking process which i went though....

        Changing the Partitioner is a bigger change... but before we go there, wondering if this optimization is going to help us?
        For BB is not cheap, but it is going to be good garbage which will live and die in young generation.

        I can think of 2 other options...
        1) We can serialize and deserialize Token in IndexSummary we still need additional function to serialize and deserialize from memory (for BOP we can serialize the key/byte[], we have also removed the token calculation overhead) so we can also try and compare incrementally.
        2) We can use MMappedFile instead and get ByteBuffer (this could work in our favor, for the new SST's which is never queried there is zero overhead in memory )

        Show
        Vijay added a comment - - edited Honestly, glad to see the thread going in the same thinking process which i went though.... Changing the Partitioner is a bigger change... but before we go there, wondering if this optimization is going to help us? For BB is not cheap, but it is going to be good garbage which will live and die in young generation. I can think of 2 other options... 1) We can serialize and deserialize Token in IndexSummary we still need additional function to serialize and deserialize from memory (for BOP we can serialize the key/byte[], we have also removed the token calculation overhead) so we can also try and compare incrementally. 2) We can use MMappedFile instead and get ByteBuffer (this could work in our favor, for the new SST's which is never queried there is zero overhead in memory )
        Hide
        Pavel Yaskevich added a comment -

        For BB is not cheap, but it is going to be good garbage which will live and die in young generation.

        Indeed, those are just containers so actual data is not copied and those buffers pretty GC friendly as you mentioned. But I think that option with using Memory with token is ok if we can encapsulate it properly.

        Show
        Pavel Yaskevich added a comment - For BB is not cheap, but it is going to be good garbage which will live and die in young generation. Indeed, those are just containers so actual data is not copied and those buffers pretty GC friendly as you mentioned. But I think that option with using Memory with token is ok if we can encapsulate it properly.
        Hide
        Jonathan Ellis added a comment - - edited

        Changing the Partitioner is a bigger change

        It does get ugly since you'd need to reimplement Murmur3.hash3_x64_128 on Memory objects. (Not for the first time, I'm pissed that ByteBuffer isn't an interface...)

        Let's go ahead and move forward with v2 and optimize later if we need to.

        Nits:

        1. rename hasSummaries to offHeapSummaries
        2. InputStream.read has an overload that takes a length parameter, you don't need to realloc the buffer
        3. The comment doesn't match the code here. Also, getIndex should be private.
          .       // multiply by 4 and add the block start
                  return bytes.getInt(index << 2);
          
        4. We can easily inline DK.compareTo instead of actually creating a DK object (i.e., call partitioner.getToken instead, then compare the tokens and keys without the DK wrapper)

        The rest LGTM.

        Show
        Jonathan Ellis added a comment - - edited Changing the Partitioner is a bigger change It does get ugly since you'd need to reimplement Murmur3.hash3_x64_128 on Memory objects. (Not for the first time, I'm pissed that ByteBuffer isn't an interface...) Let's go ahead and move forward with v2 and optimize later if we need to. Nits: rename hasSummaries to offHeapSummaries InputStream.read has an overload that takes a length parameter, you don't need to realloc the buffer The comment doesn't match the code here. Also, getIndex should be private. . // multiply by 4 and add the block start return bytes.getInt(index << 2); We can easily inline DK.compareTo instead of actually creating a DK object (i.e., call partitioner.getToken instead, then compare the tokens and keys without the DK wrapper) The rest LGTM.
        Hide
        Vijay added a comment -

        Committed to Trunk!

        Added following function to DK to support RowPosition

        public static int compareTo(IPartitioner partitioner, ByteBuffer key, RowPosition position)
            {
                // delegate to Token.KeyBound if needed
                if (!(position instanceof DecoratedKey))
                    return -position.compareTo(partitioner.decorateKey(key));
        
                DecoratedKey otherKey = (DecoratedKey) position;
                int cmp = partitioner.getToken(key).compareTo(otherKey.getToken());
                return cmp == 0 ? ByteBufferUtil.compareUnsigned(key, otherKey.key) : cmp;
            }
        

        Thanks!

        Show
        Vijay added a comment - Committed to Trunk! Added following function to DK to support RowPosition public static int compareTo(IPartitioner partitioner, ByteBuffer key, RowPosition position) { // delegate to Token.KeyBound if needed if (!(position instanceof DecoratedKey)) return -position.compareTo(partitioner.decorateKey(key)); DecoratedKey otherKey = (DecoratedKey) position; int cmp = partitioner.getToken(key).compareTo(otherKey.getToken()); return cmp == 0 ? ByteBufferUtil.compareUnsigned(key, otherKey.key) : cmp; } Thanks!

          People

          • Assignee:
            Vijay
            Reporter:
            Jonathan Ellis
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development