Cassandra
  1. Cassandra
  2. CASSANDRA-4023

Improve BloomFilter deserialization performance

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.0.9, 1.1.0
    • Component/s: Core
    • Labels:

      Description

      The difference of startup times between a 0.8.7 cluster and 1.0.7 cluster with the same amount of data is 4x greater in 1.0.7.

      It seems as though 1.0.7 loads the BloomFilter through a series of reading longs out in a multithreaded process while 0.8.7 reads the entire object.

      Perhaps we should update the new BloomFilter to do reading in batch as well?

      1. trunk-4023.txt
        4 kB
        Yuki Morishita
      2. cassandra-1.0-4023-v3.txt
        6 kB
        Yuki Morishita
      3. cassandra-1.0-4023-v2.txt
        3 kB
        Yuki Morishita
      4. 4023.txt
        2 kB
        Jonathan Ellis

        Activity

        Hide
        Brandon Williams added a comment -

        The theory here being that the multithreadedness is causing seek contention when loading the sstables.

        Show
        Brandon Williams added a comment - The theory here being that the multithreadedness is causing seek contention when loading the sstables.
        Hide
        Jonathan Ellis added a comment -

        We added the multithreadedness specifically because it improves startup time for people with multiple spindles or SSDs...

        Any ideas how to get the best of both worlds besides falling back to a config options?

        (Maybe it's time to add random vs sequential speed ratio as a setting, which at least is general enough to be useful in other places.)

        Show
        Jonathan Ellis added a comment - We added the multithreadedness specifically because it improves startup time for people with multiple spindles or SSDs... Any ideas how to get the best of both worlds besides falling back to a config options? (Maybe it's time to add random vs sequential speed ratio as a setting, which at least is general enough to be useful in other places.)
        Hide
        Brandon Williams added a comment -

        (Maybe it's time to add random vs sequential speed ratio as a setting, which at least is general enough to be useful in other places.)

        This sounds like a good idea, we're never going to strike a balance that's sufficient between SSD and rotational media without a knob to turn.

        Show
        Brandon Williams added a comment - (Maybe it's time to add random vs sequential speed ratio as a setting, which at least is general enough to be useful in other places.) This sounds like a good idea, we're never going to strike a balance that's sufficient between SSD and rotational media without a knob to turn.
        Hide
        Michael Harris added a comment - - edited

        My $0.02 is that it may be helpful to batch reads. Not sure if the underlying stream used in reading the bloom filters reads a large chunk and caches it, but if not, it could help to instead of just calling ois.readLong(), you read 64K or 1M or whatever you feel is appropriate (maybe configurable?) into a buffer and grab the longs out of those. This doesn't completely fix the problem of disk contention, but it might cause larger sequential reads to be submitted to the disk, which then might behave nicer?

        The specific example I'm thinking of here is: it looks like the deserialization of LegacyBloomFilter (perhaps what 0.8 uses?) is just a ois.readObject() for a BitSet. And that's like, it. Whereas for BloomFilter (the new version?), deserialization is a tight loop of readLong() calls. Same with serialization FWIW. Not that using Java serialization for LTS is necessarily a good idea, but it may be happier for the disk.

        Show
        Michael Harris added a comment - - edited My $0.02 is that it may be helpful to batch reads. Not sure if the underlying stream used in reading the bloom filters reads a large chunk and caches it, but if not, it could help to instead of just calling ois.readLong(), you read 64K or 1M or whatever you feel is appropriate (maybe configurable?) into a buffer and grab the longs out of those. This doesn't completely fix the problem of disk contention, but it might cause larger sequential reads to be submitted to the disk, which then might behave nicer? The specific example I'm thinking of here is: it looks like the deserialization of LegacyBloomFilter (perhaps what 0.8 uses?) is just a ois.readObject() for a BitSet. And that's like, it. Whereas for BloomFilter (the new version?), deserialization is a tight loop of readLong() calls. Same with serialization FWIW. Not that using Java serialization for LTS is necessarily a good idea, but it may be happier for the disk.
        Hide
        Jonathan Ellis added a comment -

        Sorry, the description threw me off. So really we're talking about our paged bloom filter (CASSANDRA-2466), not the multithreaded startup (CASSANDRA-2988).

        Show
        Jonathan Ellis added a comment - Sorry, the description threw me off. So really we're talking about our paged bloom filter ( CASSANDRA-2466 ), not the multithreaded startup ( CASSANDRA-2988 ).
        Hide
        Jonathan Ellis added a comment -

        patch to do buffered bloom filter deserialization

        Show
        Jonathan Ellis added a comment - patch to do buffered bloom filter deserialization
        Hide
        Michael Harris added a comment -

        +1 on the patch. When we get a chance, we'll try out the patch on the cluster that demonstrated the issue.

        The overall intent of the ticket filed is that startup is slower overall, and this was just one example in the code that I found that could be part of the problem. Any other ideas as to things that could cause slower startup? We noticed the logs had a lot of SSTables being opened during startup, and we do have some 100+GB SSTables on disk, so general SSTable reading was my first instinct to check.

        Show
        Michael Harris added a comment - +1 on the patch. When we get a chance, we'll try out the patch on the cluster that demonstrated the issue. The overall intent of the ticket filed is that startup is slower overall, and this was just one example in the code that I found that could be part of the problem. Any other ideas as to things that could cause slower startup? We noticed the logs had a lot of SSTables being opened during startup, and we do have some 100+GB SSTables on disk, so general SSTable reading was my first instinct to check.
        Hide
        Michael Harris added a comment -

        Also, FWIW, the observed slowness is in 1.0.7, not 1.0.1 as mentioned in the affects version (though it certainly may affect 1.0.1 as well). I don't seem to have permission to update this.

        Show
        Michael Harris added a comment - Also, FWIW, the observed slowness is in 1.0.7, not 1.0.1 as mentioned in the affects version (though it certainly may affect 1.0.1 as well). I don't seem to have permission to update this.
        Hide
        Jonathan Ellis added a comment -

        Right, actual versions affected are [affects-version .. fix-version)

        Show
        Jonathan Ellis added a comment - Right, actual versions affected are [affects-version .. fix-version)
        Hide
        Michael Harris added a comment -

        Hi Jonathan. We patched our cluster and startup time didn't decrease much if at all. I still think the patch is probably a positive help, but it looks like there are bigger problems elsewhere. Let me know if any additional testing or logs could help out here.

        Show
        Michael Harris added a comment - Hi Jonathan. We patched our cluster and startup time didn't decrease much if at all. I still think the patch is probably a positive help, but it looks like there are bigger problems elsewhere. Let me know if any additional testing or logs could help out here.
        Hide
        Yuki Morishita added a comment -

        I measured how long does it take to load each sstable component(Data, Index, Filter) and found out that loading from index file takes longer in 1.0 than in 0.8.
        By looking code for difference between 0.8 and 1.0, I noticed that in 1.0, every keys stored in index file get deserialized, while in 0.8, only those keys that should be added to index summary get deserialized.
        The reason we deserialize all keys in 1.0 is to obtain first and last keys stored in sstable. Attached patch tries to skip deserializing keys when possible.

        Patch is against 1.0 branch.

        Show
        Yuki Morishita added a comment - I measured how long does it take to load each sstable component(Data, Index, Filter) and found out that loading from index file takes longer in 1.0 than in 0.8. By looking code for difference between 0.8 and 1.0, I noticed that in 1.0, every keys stored in index file get deserialized, while in 0.8, only those keys that should be added to index summary get deserialized. The reason we deserialize all keys in 1.0 is to obtain first and last keys stored in sstable. Attached patch tries to skip deserializing keys when possible. Patch is against 1.0 branch.
        Hide
        Jonathan Ellis added a comment -

        LGTM, but can you add a test for indexSummary, SSTR.first, and SSTR.last? (Adding to one of the existing SSTR or CF tests is fine of course.)

        Show
        Jonathan Ellis added a comment - LGTM, but can you add a test for indexSummary, SSTR.first, and SSTR.last? (Adding to one of the existing SSTR or CF tests is fine of course.)
        Hide
        Yuki Morishita added a comment -

        Attaching v3, which is slightly modified from v2 and comes with test.

        Show
        Yuki Morishita added a comment - Attaching v3, which is slightly modified from v2 and comes with test.
        Hide
        Jonathan Ellis added a comment -

        committed, thanks!

        Show
        Jonathan Ellis added a comment - committed, thanks!
        Hide
        Yuki Morishita added a comment -

        In trunk, CASSANDRA-2319 changed the content in primary index file, so we need to calculate its serialized size differently to determine if the key is the last one.

        Show
        Yuki Morishita added a comment - In trunk, CASSANDRA-2319 changed the content in primary index file, so we need to calculate its serialized size differently to determine if the key is the last one.
        Hide
        Yuki Morishita added a comment -

        Attaching patch to current trunk.

        Unfortunately, there is no way to determine serialized size of promoted row index without actually deserializing some of it, so we cannot determine the last key.
        Patch deserializes every key when sstable has promoted index, but instead it tries to reduce the amount of io when deserialize index itself.

        Show
        Yuki Morishita added a comment - Attaching patch to current trunk. Unfortunately, there is no way to determine serialized size of promoted row index without actually deserializing some of it, so we cannot determine the last key. Patch deserializes every key when sstable has promoted index, but instead it tries to reduce the amount of io when deserialize index itself.
        Hide
        Jonathan Ellis added a comment -

        Is this io-reducing something that could also apply to 1.0?

        Show
        Jonathan Ellis added a comment - Is this io-reducing something that could also apply to 1.0?
        Hide
        Yuki Morishita added a comment -

        Is this io-reducing something that could also apply to 1.0?

        It's for RowIndexEntry which is introduced for v1.2 (CASSANDRA-2319), so it only relates to trunk.

        Show
        Yuki Morishita added a comment - Is this io-reducing something that could also apply to 1.0? It's for RowIndexEntry which is introduced for v1.2 ( CASSANDRA-2319 ), so it only relates to trunk.
        Hide
        Jonathan Ellis added a comment -

        Could we rename lastKey to lastKeyForUnpromoted, and then take the hasPromotedIndexes out of that assignment? Since you also added it to the big if statement.

        I guess deserializePositionOnly is only for unpromoted indexes? Because for promoted ones decoratedKey will never be null.

        +1 otherwise, I guess we'll have to rely on CASSANDRA-2392 for startup speed in 1.2.

        Show
        Jonathan Ellis added a comment - Could we rename lastKey to lastKeyForUnpromoted, and then take the hasPromotedIndexes out of that assignment? Since you also added it to the big if statement. I guess deserializePositionOnly is only for unpromoted indexes? Because for promoted ones decoratedKey will never be null. +1 otherwise, I guess we'll have to rely on CASSANDRA-2392 for startup speed in 1.2.
        Hide
        Dave Brosius added a comment -

        Getting NPE around this code (trunk).

        SSTableReader.open on system-schema_columnfamilies-ib-2

        SSTableReader.load bails out of the while (true) loop because

        if (indexPosition == indexSize)

        causing the right variable to remain null, which then gets passed to

        this.last = getMinimalKey(right);

        which NPEs

        boolean lastKey = indexPosition + DBConstants.SHORT_SIZE + len + DBConstants.LONG_SIZE == indexSize; is false on the last item

        indexPosition = 0
        len = 9
        indexSize = 23

        19 != 23

        public static DecoratedKey getMinimalKey(DecoratedKey key)

        { return key.key.position() > 0 || key.key.hasRemaining() ? new DecoratedKey(key.token, HeapAllocator.instance.clone(key.key)) : key; }
        Show
        Dave Brosius added a comment - Getting NPE around this code (trunk). SSTableReader.open on system-schema_columnfamilies-ib-2 SSTableReader.load bails out of the while (true) loop because if (indexPosition == indexSize) causing the right variable to remain null, which then gets passed to this.last = getMinimalKey(right); which NPEs boolean lastKey = indexPosition + DBConstants.SHORT_SIZE + len + DBConstants.LONG_SIZE == indexSize; is false on the last item indexPosition = 0 len = 9 indexSize = 23 19 != 23 public static DecoratedKey getMinimalKey(DecoratedKey key) { return key.key.position() > 0 || key.key.hasRemaining() ? new DecoratedKey(key.token, HeapAllocator.instance.clone(key.key)) : key; }
        Hide
        Dave Brosius added a comment - - edited

        Seems like this is the fix to me (at least in the simple case where the int read in deserialize is 0) more would need to be done if not.

        diff --git a/src/java/org/apache/cassandra/io/sstable/SSTableReader.java b/src/java/org/apache/cassandra/io/sstable/SSTableReader.java
        index 9e92220..959a66d 100644
        — a/src/java/org/apache/cassandra/io/sstable/SSTableReader.java
        +++ b/src/java/org/apache/cassandra/io/sstable/SSTableReader.java
        @@ -361,7 +361,7 @@ public class SSTableReader extends SSTable
        int len = ByteBufferUtil.readShortLength(input);

        boolean firstKey = left == null;

        • boolean lastKey = indexPosition + DBConstants.SHORT_SIZE + len + DBConstants.LONG_SIZE == indexSize;
          + boolean lastKey = indexPosition + DBConstants.SHORT_SIZE + len + DBConstants.LONG_SIZE + (descriptor.hasPromotedIndexes ? DBConstants.INT_SIZE : 0) == indexSize;
          boolean shouldAddEntry = indexSummary.shouldAddEntry();
          if (shouldAddEntry || cacheLoading || recreatebloom || firstKey || lastKey)
          {
        Show
        Dave Brosius added a comment - - edited Seems like this is the fix to me (at least in the simple case where the int read in deserialize is 0) more would need to be done if not. diff --git a/src/java/org/apache/cassandra/io/sstable/SSTableReader.java b/src/java/org/apache/cassandra/io/sstable/SSTableReader.java index 9e92220..959a66d 100644 — a/src/java/org/apache/cassandra/io/sstable/SSTableReader.java +++ b/src/java/org/apache/cassandra/io/sstable/SSTableReader.java @@ -361,7 +361,7 @@ public class SSTableReader extends SSTable int len = ByteBufferUtil.readShortLength(input); boolean firstKey = left == null; boolean lastKey = indexPosition + DBConstants.SHORT_SIZE + len + DBConstants.LONG_SIZE == indexSize; + boolean lastKey = indexPosition + DBConstants.SHORT_SIZE + len + DBConstants.LONG_SIZE + (descriptor.hasPromotedIndexes ? DBConstants.INT_SIZE : 0) == indexSize; boolean shouldAddEntry = indexSummary.shouldAddEntry(); if (shouldAddEntry || cacheLoading || recreatebloom || firstKey || lastKey) {
        Hide
        Yuki Morishita added a comment -

        Could we rename lastKey to lastKeyForUnpromoted, and then take the hasPromotedIndexes out of that assignment?

        Done.

        I guess deserializePositionOnly is only for unpromoted indexes?

        It deserializes whole row index only when decoratedKey is not null AND cache key. Otherwise only deserialize position.

        Show
        Yuki Morishita added a comment - Could we rename lastKey to lastKeyForUnpromoted, and then take the hasPromotedIndexes out of that assignment? Done. I guess deserializePositionOnly is only for unpromoted indexes? It deserializes whole row index only when decoratedKey is not null AND cache key. Otherwise only deserialize position.
        Hide
        Jonathan Ellis added a comment -

        Does v3 include a fix for the NPE Dave found?

        Show
        Jonathan Ellis added a comment - Does v3 include a fix for the NPE Dave found?
        Hide
        Yuki Morishita added a comment -

        NPE only happens on trunk and that is why I reopened this issue.
        Fix is attached as trunk-4023.txt.
        v3 attached above is for 1.0 branch and already committed to 1.0 and 1.1 without any problem.

        Show
        Yuki Morishita added a comment - NPE only happens on trunk and that is why I reopened this issue. Fix is attached as trunk-4023.txt. v3 attached above is for 1.0 branch and already committed to 1.0 and 1.1 without any problem.
        Hide
        Dave Brosius added a comment - - edited

        no v3 has this bug

        the problem is in the general case where RowIndexEntry.serializer.deserialize() is using the hasPromotedIndexes code, the amount of data read is variable (assuming in this case the first int read in deserialize can be non 0). Therefore in SSTableReader the code can no longer statically predict when a particular key will be the last key.

        Ah I see.. yes this happens on trunk... so if it is fixed elsewhere, then good... never mind me.

        Show
        Dave Brosius added a comment - - edited no v3 has this bug the problem is in the general case where RowIndexEntry.serializer.deserialize() is using the hasPromotedIndexes code, the amount of data read is variable (assuming in this case the first int read in deserialize can be non 0). Therefore in SSTableReader the code can no longer statically predict when a particular key will be the last key. Ah I see.. yes this happens on trunk... so if it is fixed elsewhere, then good... never mind me.
        Hide
        Jonathan Ellis added a comment -

        committed.

        (I note, for the record, that my original 4023.txt patch is unnecessary because we're already using a buffered inputstream in loadBloomFilter.)

        Show
        Jonathan Ellis added a comment - committed. (I note, for the record, that my original 4023.txt patch is unnecessary because we're already using a buffered inputstream in loadBloomFilter.)

          People

          • Assignee:
            Yuki Morishita
            Reporter:
            Joaquin Casares
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development