HBase
  1. HBase
  2. HBASE-1938

Make in-memory table scanning faster

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.4, 0.92.0
    • Component/s: Performance
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This issue is about profiling hbase to see if I can make hbase scans run faster when all is up in memory. Talking to some users, they are seeing about 1/4 million rows a second. It should be able to go faster than this (Scanning an array of objects, they can do about 4-5x this).

      1. test.patch
        2 kB
        stack
      2. caching-keylength-in-kv.patch
        1 kB
        stack
      3. MemStoreScanPerformance.java
        2 kB
        stack
      4. MemStoreScanPerformance.java
        2 kB
        Nicolas Liochon
      5. 20110726_1938_KeyValueSkipListSet.patch
        2 kB
        Nicolas Liochon
      6. 20110726_1938_MemStore.patch
        6 kB
        Nicolas Liochon
      7. 20110726_1938_MemStoreScanPerformance.java
        3 kB
        Nicolas Liochon
      8. MemStoreScanPerformance.java
        3 kB
        Nicolas Liochon
      9. 20110802_MemStore.patch
        3 kB
        Nicolas Liochon

        Activity

        Hide
        Nicolas Liochon added a comment -

        Yes, I believe Andy's issue was cause by:

        • the improvement was actually not used in the test case
        • may be the point in the getLower implementation mentionned above.

        I fixed both points, so I believe it can be integrated.

        I also added the synchronized because it was strange to have all other public methods synchronized except this one, but I didn't check if some test case were actually calling them in //. I will have a look at HBASE-3855 and come back to you if I find something.

        Show
        Nicolas Liochon added a comment - Yes, I believe Andy's issue was cause by: the improvement was actually not used in the test case may be the point in the getLower implementation mentionned above. I fixed both points, so I believe it can be integrated. I also added the synchronized because it was strange to have all other public methods synchronized except this one, but I didn't check if some test case were actually calling them in //. I will have a look at HBASE-3855 and come back to you if I find something.
        Hide
        stack added a comment -

        @nkeywal What you need me to do here? Should I try 20110802_MemStore.patch and see if it still has issue Andrew identified. You added synchronize of reseek. We're having consistency issues elsewhere – hbase-3855 – so do you think it could be because of the lack of sync?

        Show
        stack added a comment - @nkeywal What you need me to do here? Should I try 20110802_MemStore.patch and see if it still has issue Andrew identified. You added synchronize of reseek. We're having consistency issues elsewhere – hbase-3855 – so do you think it could be because of the lack of sync?
        Hide
        Nicolas Liochon added a comment -

        For any future modification on this part: on jdk 1.6.24, it seems that this code:

            protected KeyValue getLower(KeyValue first, KeyValue second) {
              if (first == null && second == null) {
                return null;
              }
              if (first != null && second != null) {
                int compare = comparator.compare(first, second);
                return (compare <= 0 ? first : second);
              }
              return (first != null ? first : second);
            }

        performs better than this one:

            protected KeyValue getLower(KeyValue first, KeyValue second) {
              if (first == null) {
                return second ;
              }
              if (second == null) {
                return first ;
              }
        
              int compare = comparator.compare(first, second);
              return (compare <= 0 ? first : second);
            }

        There is a lot of variances in the result, but the average goes for the first one.
        The first one is the current implementation; so I kept it (this is a difference with the previous patch). The second one is actually used in the same file for a similar problem, in MemStore#getLowest, nevertheless, I kept it as well.

        Show
        Nicolas Liochon added a comment - For any future modification on this part: on jdk 1.6.24, it seems that this code: protected KeyValue getLower(KeyValue first, KeyValue second) { if (first == null && second == null) { return null; } if (first != null && second != null) { int compare = comparator.compare(first, second); return (compare <= 0 ? first : second); } return (first != null ? first : second); } performs better than this one: protected KeyValue getLower(KeyValue first, KeyValue second) { if (first == null) { return second ; } if (second == null) { return first ; } int compare = comparator.compare(first, second); return (compare <= 0 ? first : second); } There is a lot of variances in the result, but the average goes for the first one. The first one is the current implementation; so I kept it (this is a difference with the previous patch). The second one is actually used in the same file for a similar problem, in MemStore#getLowest, nevertheless, I kept it as well.
        Hide
        Nicolas Liochon added a comment -

        Test added to show the impact when the snapshot is not empty.

        Patch on MemStore.java, to optimize the "peek()" function by precalculating the next value. Improve the performances by 25% on the test above (i.e. when the two lists are not empty), by saving a call on the comparator.

        "reseek" is now synchronized, + minor modifications (@overide added, private added, ...).

        Show
        Nicolas Liochon added a comment - Test added to show the impact when the snapshot is not empty. Patch on MemStore.java, to optimize the "peek()" function by precalculating the next value. Improve the performances by 25% on the test above (i.e. when the two lists are not empty), by saving a call on the comparator. "reseek" is now synchronized, + minor modifications (@overide added, private added, ...).
        Hide
        stack added a comment -

        Sounds great nkeywal. Thanks for digging in here.

        Show
        stack added a comment - Sounds great nkeywal. Thanks for digging in here.
        Hide
        Nicolas Liochon added a comment -

        Ok, I understand what's going on.

        The enhancement on readPoint aims at calling a TLS once instead of twice by call to next().

        This should work well when the kvset and snapshot lists are not empty. However, in the unit test, the snapshot list is empty, so we were already calling the TLS only once before.

        I will write a second test to highlight the difference.

        This said, as Andrew and Todd think that a modification on readPoint could change the consistency behavior, I don't think it's worth doing the modification.

        So if you aggree, I will:

        • write a test with a non empty snapshot
        • provide a patch on MemStore with all the changes except the readPoint

        With the iterator change, that makes a big boost on the scan perf already.

        Show
        Nicolas Liochon added a comment - Ok, I understand what's going on. The enhancement on readPoint aims at calling a TLS once instead of twice by call to next(). This should work well when the kvset and snapshot lists are not empty. However, in the unit test, the snapshot list is empty, so we were already calling the TLS only once before. I will write a second test to highlight the difference. This said, as Andrew and Todd think that a modification on readPoint could change the consistency behavior, I don't think it's worth doing the modification. So if you aggree, I will: write a test with a non empty snapshot provide a patch on MemStore with all the changes except the readPoint With the iterator change, that makes a big boost on the scan perf already.
        Hide
        Andrew Purtell added a comment -

        I see the same type of results with either Sun Java 1.6.0_26 or OpenJDK 1.6.0_22.

        Show
        Andrew Purtell added a comment - I see the same type of results with either Sun Java 1.6.0_26 or OpenJDK 1.6.0_22.
        Hide
        Nicolas Liochon added a comment -

        Thank you. It's interesting because the tests with 150000 entries are very
        consistent. All the time ~49ms without the patch and ~51ms with the patch.
        Could be hotspot related. Or not. I will have a look at this.

        Show
        Nicolas Liochon added a comment - Thank you. It's interesting because the tests with 150000 entries are very consistent. All the time ~49ms without the patch and ~51ms with the patch. Could be hotspot related. Or not. I will have a look at this.
        Hide
        Andrew Purtell added a comment -

        This is a representative test with 0.90 branch.

        With 20110726_1938_MemStore.patch:

        JUnit version 4.8.1
        .Loaded in 1158 ms
        Scan with size 50000: 61 ms
        Scan with size 50000: 91 ms
        Scan with size 50000: 35 ms
        Scan with size 50000: 23 ms
        Scan with size 50000: 24 ms
        Scan with size 50000: 22 ms
        Scan with size 50000: 22 ms
        Scan with size 50000: 23 ms
        Scan with size 50000: 22 ms
        Scan with size 50000: 23 ms
        Loaded in 1709 ms
        Scan with size 75000: 31 ms
        Scan with size 75000: 30 ms
        Scan with size 75000: 30 ms
        Scan with size 75000: 29 ms
        Scan with size 75000: 30 ms
        Scan with size 75000: 30 ms
        Scan with size 75000: 30 ms
        Scan with size 75000: 30 ms
        Scan with size 75000: 29 ms
        Scan with size 75000: 29 ms
        Loaded in 2099 ms
        Scan with size 100000: 36 ms
        Scan with size 100000: 38 ms
        Scan with size 100000: 38 ms
        Scan with size 100000: 35 ms
        Scan with size 100000: 35 ms
        Scan with size 100000: 34 ms
        Scan with size 100000: 35 ms
        Scan with size 100000: 34 ms
        Scan with size 100000: 34 ms
        Scan with size 100000: 38 ms
        Loaded in 2813 ms
        Scan with size 125000: 45 ms
        Scan with size 125000: 44 ms
        Scan with size 125000: 53 ms
        Scan with size 125000: 44 ms
        Scan with size 125000: 44 ms
        Scan with size 125000: 44 ms
        Scan with size 125000: 44 ms
        Scan with size 125000: 44 ms
        Scan with size 125000: 44 ms
        Scan with size 125000: 45 ms
        Loaded in 3456 ms
        Scan with size 150000: 54 ms
        Scan with size 150000: 51 ms
        Scan with size 150000: 50 ms
        Scan with size 150000: 50 ms
        Scan with size 150000: 50 ms
        Scan with size 150000: 52 ms
        Scan with size 150000: 50 ms
        Scan with size 150000: 50 ms
        Scan with size 150000: 51 ms
        Scan with size 150000: 50 ms
        Total load time: 11235 ms (i.e:11 seconds)
        Total scan time: 1959 ms (i.e:1 seconds)
        Rows scanned per seconds: 2552322
        Rows loaded per seconds: 445037
        
        Time: 13.515
        
        OK (1 test)
        

        Without 20110726_1938_MemStore.patch:

        JUnit version 4.8.1
        .Loaded in 1197 ms
        Scan with size 50000: 51 ms
        Scan with size 50000: 119 ms
        Scan with size 50000: 21 ms
        Scan with size 50000: 22 ms
        Scan with size 50000: 22 ms
        Scan with size 50000: 22 ms
        Scan with size 50000: 21 ms
        Scan with size 50000: 22 ms
        Scan with size 50000: 22 ms
        Scan with size 50000: 22 ms
        Loaded in 1496 ms
        Scan with size 75000: 31 ms
        Scan with size 75000: 32 ms
        Scan with size 75000: 29 ms
        Scan with size 75000: 29 ms
        Scan with size 75000: 30 ms
        Scan with size 75000: 30 ms
        Scan with size 75000: 37 ms
        Scan with size 75000: 30 ms
        Scan with size 75000: 30 ms
        Scan with size 75000: 30 ms
        Loaded in 2160 ms
        Scan with size 100000: 33 ms
        Scan with size 100000: 32 ms
        Scan with size 100000: 33 ms
        Scan with size 100000: 32 ms
        Scan with size 100000: 32 ms
        Scan with size 100000: 33 ms
        Scan with size 100000: 32 ms
        Scan with size 100000: 33 ms
        Scan with size 100000: 32 ms
        Scan with size 100000: 34 ms
        Loaded in 2720 ms
        Scan with size 125000: 42 ms
        Scan with size 125000: 51 ms
        Scan with size 125000: 41 ms
        Scan with size 125000: 42 ms
        Scan with size 125000: 41 ms
        Scan with size 125000: 42 ms
        Scan with size 125000: 42 ms
        Scan with size 125000: 41 ms
        Scan with size 125000: 41 ms
        Scan with size 125000: 43 ms
        Loaded in 3518 ms
        Scan with size 150000: 49 ms
        Scan with size 150000: 49 ms
        Scan with size 150000: 50 ms
        Scan with size 150000: 49 ms
        Scan with size 150000: 50 ms
        Scan with size 150000: 49 ms
        Scan with size 150000: 49 ms
        Scan with size 150000: 50 ms
        Scan with size 150000: 49 ms
        Scan with size 150000: 49 ms
        Total load time: 11091 ms (i.e:11 seconds)
        Total scan time: 1895 ms (i.e:1 seconds)
        Rows scanned per seconds: 2638522
        Rows loaded per seconds: 450815
        
        Time: 13.375
        
        OK (1 test)
        
        Show
        Andrew Purtell added a comment - This is a representative test with 0.90 branch. With 20110726_1938_MemStore.patch: JUnit version 4.8.1 .Loaded in 1158 ms Scan with size 50000: 61 ms Scan with size 50000: 91 ms Scan with size 50000: 35 ms Scan with size 50000: 23 ms Scan with size 50000: 24 ms Scan with size 50000: 22 ms Scan with size 50000: 22 ms Scan with size 50000: 23 ms Scan with size 50000: 22 ms Scan with size 50000: 23 ms Loaded in 1709 ms Scan with size 75000: 31 ms Scan with size 75000: 30 ms Scan with size 75000: 30 ms Scan with size 75000: 29 ms Scan with size 75000: 30 ms Scan with size 75000: 30 ms Scan with size 75000: 30 ms Scan with size 75000: 30 ms Scan with size 75000: 29 ms Scan with size 75000: 29 ms Loaded in 2099 ms Scan with size 100000: 36 ms Scan with size 100000: 38 ms Scan with size 100000: 38 ms Scan with size 100000: 35 ms Scan with size 100000: 35 ms Scan with size 100000: 34 ms Scan with size 100000: 35 ms Scan with size 100000: 34 ms Scan with size 100000: 34 ms Scan with size 100000: 38 ms Loaded in 2813 ms Scan with size 125000: 45 ms Scan with size 125000: 44 ms Scan with size 125000: 53 ms Scan with size 125000: 44 ms Scan with size 125000: 44 ms Scan with size 125000: 44 ms Scan with size 125000: 44 ms Scan with size 125000: 44 ms Scan with size 125000: 44 ms Scan with size 125000: 45 ms Loaded in 3456 ms Scan with size 150000: 54 ms Scan with size 150000: 51 ms Scan with size 150000: 50 ms Scan with size 150000: 50 ms Scan with size 150000: 50 ms Scan with size 150000: 52 ms Scan with size 150000: 50 ms Scan with size 150000: 50 ms Scan with size 150000: 51 ms Scan with size 150000: 50 ms Total load time: 11235 ms (i.e:11 seconds) Total scan time: 1959 ms (i.e:1 seconds) Rows scanned per seconds: 2552322 Rows loaded per seconds: 445037 Time: 13.515 OK (1 test) Without 20110726_1938_MemStore.patch: JUnit version 4.8.1 .Loaded in 1197 ms Scan with size 50000: 51 ms Scan with size 50000: 119 ms Scan with size 50000: 21 ms Scan with size 50000: 22 ms Scan with size 50000: 22 ms Scan with size 50000: 22 ms Scan with size 50000: 21 ms Scan with size 50000: 22 ms Scan with size 50000: 22 ms Scan with size 50000: 22 ms Loaded in 1496 ms Scan with size 75000: 31 ms Scan with size 75000: 32 ms Scan with size 75000: 29 ms Scan with size 75000: 29 ms Scan with size 75000: 30 ms Scan with size 75000: 30 ms Scan with size 75000: 37 ms Scan with size 75000: 30 ms Scan with size 75000: 30 ms Scan with size 75000: 30 ms Loaded in 2160 ms Scan with size 100000: 33 ms Scan with size 100000: 32 ms Scan with size 100000: 33 ms Scan with size 100000: 32 ms Scan with size 100000: 32 ms Scan with size 100000: 33 ms Scan with size 100000: 32 ms Scan with size 100000: 33 ms Scan with size 100000: 32 ms Scan with size 100000: 34 ms Loaded in 2720 ms Scan with size 125000: 42 ms Scan with size 125000: 51 ms Scan with size 125000: 41 ms Scan with size 125000: 42 ms Scan with size 125000: 41 ms Scan with size 125000: 42 ms Scan with size 125000: 42 ms Scan with size 125000: 41 ms Scan with size 125000: 41 ms Scan with size 125000: 43 ms Loaded in 3518 ms Scan with size 150000: 49 ms Scan with size 150000: 49 ms Scan with size 150000: 50 ms Scan with size 150000: 49 ms Scan with size 150000: 50 ms Scan with size 150000: 49 ms Scan with size 150000: 49 ms Scan with size 150000: 50 ms Scan with size 150000: 49 ms Scan with size 150000: 49 ms Total load time: 11091 ms (i.e:11 seconds) Total scan time: 1895 ms (i.e:1 seconds) Rows scanned per seconds: 2638522 Rows loaded per seconds: 450815 Time: 13.375 OK (1 test)
        Hide
        Nicolas Liochon added a comment -

        Yes, I can provide a patch with all the changes except the one on readPoint.
        (I will test it first) Andrew, when you're saying there is a loss on your
        test, are you running all the unit tests, only the ones on region server,
        only the one on scan performance, or another set?

        Show
        Nicolas Liochon added a comment - Yes, I can provide a patch with all the changes except the one on readPoint. (I will test it first) Andrew, when you're saying there is a loss on your test, are you running all the unit tests, only the ones on region server, only the one on scan performance, or another set?
        Hide
        stack added a comment -

        Reverted 20110726_1938_MemStore.patch for now on TRUNK (left iterator in place)

        Show
        stack added a comment - Reverted 20110726_1938_MemStore.patch for now on TRUNK (left iterator in place)
        Hide
        stack added a comment -

        Let me back out application of MemStore patch till we figure whats up.

        Show
        stack added a comment - Let me back out application of MemStore patch till we figure whats up.
        Hide
        Andrew Purtell added a comment -

        I ran 3 consecutive tests with MemStore changes applied, or not. Then after seeing results I did not expect – longer run times for both loading and scanning with the MemStore changes applied and higher per-op latencies (e.g. 38 ms vs 44 ms) – I ran another set of 5 runs. The results of all but one were consistent. This was not a rigorous test I suppose, I didn't try it on another server, because I was already skeptical of the wisdom of changing how the read point is done on branch.

        Show
        Andrew Purtell added a comment - I ran 3 consecutive tests with MemStore changes applied, or not. Then after seeing results I did not expect – longer run times for both loading and scanning with the MemStore changes applied and higher per-op latencies (e.g. 38 ms vs 44 ms) – I ran another set of 5 runs. The results of all but one were consistent. This was not a rigorous test I suppose, I didn't try it on another server, because I was already skeptical of the wisdom of changing how the read point is done on branch.
        Hide
        Nicolas Liochon added a comment -

        Ok, that's strange. This patch includes multiple small modifications. One of
        them must be have a side effect on something. Is there any test on which the
        loss is more visible?

        What would be the scenario with a change on the consistency?

        On Wed, Jul 27, 2011 at 11:57 PM, Andrew Purtell (JIRA)
        <jira@apache.org>wrote:at

        Show
        Nicolas Liochon added a comment - Ok, that's strange. This patch includes multiple small modifications. One of them must be have a side effect on something. Is there any test on which the loss is more visible? What would be the scenario with a change on the consistency? On Wed, Jul 27, 2011 at 11:57 PM, Andrew Purtell (JIRA) <jira@apache.org>wrote:at
        Hide
        Andrew Purtell added a comment -

        Applied 20110726_1938_KeyValueSkipListSet and 20110726_1938_MemStoreScanPerformance to 0.90 branch.

        Show
        Andrew Purtell added a comment - Applied 20110726_1938_KeyValueSkipListSet and 20110726_1938_MemStoreScanPerformance to 0.90 branch.
        Hide
        Andrew Purtell added a comment -

        the bits that change the read-point seem a little scary for branch since they can affect subtle consistency bugs

        I agree and my testing seems to show a small performance loss in most runs with the MemStore changes applied. So only the iterator change and the benchmark will go in.

        Show
        Andrew Purtell added a comment - the bits that change the read-point seem a little scary for branch since they can affect subtle consistency bugs I agree and my testing seems to show a small performance loss in most runs with the MemStore changes applied. So only the iterator change and the benchmark will go in.
        Hide
        Todd Lipcon added a comment -

        The iterator one seems like a no-brainer for branch... the bits that change the read-point seem a little scary for branch since they can affect subtle consistency bugs

        Show
        Todd Lipcon added a comment - The iterator one seems like a no-brainer for branch... the bits that change the read-point seem a little scary for branch since they can affect subtle consistency bugs
        Hide
        Andrew Purtell added a comment -

        20110726_1938* apply cleanly to 0.90 as well. I've applied them and am investigating. Will commit to branch if the results test clean and there seems to be a benefit.

        Show
        Andrew Purtell added a comment - 20110726_1938* apply cleanly to 0.90 as well. I've applied them and am investigating. Will commit to branch if the results test clean and there seems to be a benefit.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2056 (See https://builds.apache.org/job/HBase-TRUNK/2056/)
        HBASE-1938 Make in-memory table scanning faster

        stack :
        Files :

        • /hbase/trunk/CHANGES.txt
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueSkipListSet.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2056 (See https://builds.apache.org/job/HBase-TRUNK/2056/ ) HBASE-1938 Make in-memory table scanning faster stack : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueSkipListSet.java
        Hide
        stack added a comment -

        Committed to TRUNK. Nice one nkeywal (I closed this issue. Lets open new ones if you find new stuff).

        Show
        stack added a comment - Committed to TRUNK. Nice one nkeywal (I closed this issue. Lets open new ones if you find new stuff).
        Hide
        Nicolas Liochon added a comment -

        Any chance of our calling a 'next' without doing a 'seek' first (or a reseek)? Am worried we'd trip on a null theNext.

        I compared with the previous version, the new implementation behaves as the previous one and returns a nice "null".

        Show
        Nicolas Liochon added a comment - Any chance of our calling a 'next' without doing a 'seek' first (or a reseek)? Am worried we'd trip on a null theNext. I compared with the previous version, the new implementation behaves as the previous one and returns a nice "null".
        Hide
        stack added a comment -

        Assigning nkeywal. Marking patch available against 0.92.

        Show
        stack added a comment - Assigning nkeywal. Marking patch available against 0.92.
        Hide
        stack added a comment -

        +1 on 20110726_1938_KeyValueSkipListSet.patch It looks great. Will commit when commit On 20110726_1938_MemStore.patch so can then close this issue. Nice work nkeywal.

        Show
        stack added a comment - +1 on 20110726_1938_KeyValueSkipListSet.patch It looks great. Will commit when commit On 20110726_1938_MemStore.patch so can then close this issue. Nice work nkeywal.
        Hide
        stack added a comment -

        On 20110726_1938_MemStore.patch:

        FYI, in future, just remove code rather than commment it out: i.e. + //long readPoint = ReadWriteConsistencyControl.getThreadReadPoint();

        The 'thenext' data member looks fine to me. Worst that could happen is that we lag the read point slightly though unlikely ('theNext' ain't best name but I see you are just taking the old local variable name so not your malnaming... no worries).

        Any chance of our calling a 'next' without doing a 'seek' first (or a reseek)? Am worried we'd trip on a null theNext.

        If seek has a synchronized, yeah, reseek should too – good one.

        I'm good with committing this. We have a bunch of tests that should vomit if stuff comes back out of order or not what we expect (If this does prove to break things, then we are lacking key coverage and lets address it then).

        I'll let it hang out a day. Someone else might have an opinion in here.

        Show
        stack added a comment - On 20110726_1938_MemStore.patch: FYI, in future, just remove code rather than commment it out: i.e. + //long readPoint = ReadWriteConsistencyControl.getThreadReadPoint(); The 'thenext' data member looks fine to me. Worst that could happen is that we lag the read point slightly though unlikely ('theNext' ain't best name but I see you are just taking the old local variable name so not your malnaming... no worries). Any chance of our calling a 'next' without doing a 'seek' first (or a reseek)? Am worried we'd trip on a null theNext. If seek has a synchronized, yeah, reseek should too – good one. I'm good with committing this. We have a bunch of tests that should vomit if stuff comes back out of order or not what we expect (If this does prove to break things, then we are lacking key coverage and lets address it then). I'll let it hang out a day. Someone else might have an opinion in here.
        Hide
        Nicolas Liochon added a comment -

        20110726_1938_KeyValueSkipListSet.patch : Use the native ValueIterator instead of a wrapper on EntryIterator, suppression an object creation for each call to the iterator.
        20110726_1938_MemStoreScanPerformance.java : Simple test case to measure scan performances
        20110726_1938_MemStore.patch : multiple small performance improvements on MemStoreScanner

        Show
        Nicolas Liochon added a comment - 20110726_1938_KeyValueSkipListSet.patch : Use the native ValueIterator instead of a wrapper on EntryIterator, suppression an object creation for each call to the iterator. 20110726_1938_MemStoreScanPerformance.java : Simple test case to measure scan performances 20110726_1938_MemStore.patch : multiple small performance improvements on MemStoreScanner
        Hide
        Nicolas Liochon added a comment -

        Thanks Stack! It could be considered as a JDK bug, as it makes the EntryIterator useless when you manipulate large lists. From a HBase point of view, there is anyway no need to use an EntryIterator so it's simpler.

        Here is the patch.

        • 20110726_1938_KeyValueSkipListSet.patch : Use the native ValueIterator instead of a wrapper on EntryIterator, suppression an object creation for each call to the iterator.
        • 20110726_1938_MemStoreScanPerformance.java : Simple test case to measure scan performances
        • 20110726_1938_MemStore.patch : multiple small performance improvements on MemStoreScanner

        It's obviously worth a review, especially the last one... Unit tests run fine.

        Show
        Nicolas Liochon added a comment - Thanks Stack! It could be considered as a JDK bug, as it makes the EntryIterator useless when you manipulate large lists. From a HBase point of view, there is anyway no need to use an EntryIterator so it's simpler. Here is the patch. 20110726_1938_KeyValueSkipListSet.patch : Use the native ValueIterator instead of a wrapper on EntryIterator, suppression an object creation for each call to the iterator. 20110726_1938_MemStoreScanPerformance.java : Simple test case to measure scan performances 20110726_1938_MemStore.patch : multiple small performance improvements on MemStoreScanner It's obviously worth a review, especially the last one... Unit tests run fine.
        Hide
        stack added a comment -

        @nkeywal I'm not as lazy this morning as I was yesterday so took a look at the java src. Indeed that looks like a nice optimization. Great stuff.

        Show
        stack added a comment - @nkeywal I'm not as lazy this morning as I was yesterday so took a look at the java src. Indeed that looks like a nice optimization. Great stuff.
        Hide
        stack added a comment -

        @nkeywal nice one!

        Show
        stack added a comment - @nkeywal nice one!
        Hide
        Nicolas Liochon added a comment -

        I have an improvement that could make a real difference.

        In Hbase, there is an iterator called MapEntryIterator, that acts in reality as a ValueIterator

        static class MapEntryIterator implements Iterator<KeyValue>
            private final Iterator<Map.Entry<KeyValue, KeyValue>> iterator;
        
            public KeyValue next() {
              return this.iterator.next().getValue();
            }
        

        However, with the current implementation of the JDK, there is an important difference between an iterator on values and an iterator on entries. From java.util.concurrent we can see:

        The ValueIterator is straighforward:

        final class ValueIterator extends Iter<V> {
                public V next() {
                    V v = nextValue;
                    advance();
                    return v;
                }
            }

        While there is some defensive programming taking place for the EntryIterator, with the creation of an immutable object.

        final class EntryIterator extends Iter<Map.Entry<K,V>> {
                public Map.Entry<K,V> next() {
                    Node<K,V> n = next;
                    V v = nextValue;
                    advance();
                    return new AbstractMap.SimpleImmutableEntry<K,V>(n.key, v);
                }
            }

        As a consequence, there is at least one object creation for every line in the hbase scanner. This creation is actually useless as we throw away the object immediatly. So, during the test several GC occur. I modified the MapEntryIterator implementation to iterate on the values.

        static class MapEntryIterator implements Iterator<KeyValue> {
            private final Iterator<KeyValue> iterator;
        
            public KeyValue next() {
              return this.iterator.next();
            }

        The scan time is divided by 3 on the test. It can obviously be put to any arbitrary improvement ratio as it's driven by the GC execution, but it should be valuable in production as well.

        I am currently running the unit tests, I will add the patch if the execution is ok.

        Show
        Nicolas Liochon added a comment - I have an improvement that could make a real difference. In Hbase, there is an iterator called MapEntryIterator, that acts in reality as a ValueIterator static class MapEntryIterator implements Iterator<KeyValue> private final Iterator<Map.Entry<KeyValue, KeyValue>> iterator; public KeyValue next() { return this.iterator.next().getValue(); } However, with the current implementation of the JDK, there is an important difference between an iterator on values and an iterator on entries. From java.util.concurrent we can see: The ValueIterator is straighforward: final class ValueIterator extends Iter<V> { public V next() { V v = nextValue; advance(); return v; } } While there is some defensive programming taking place for the EntryIterator, with the creation of an immutable object. final class EntryIterator extends Iter<Map.Entry<K,V>> { public Map.Entry<K,V> next() { Node<K,V> n = next; V v = nextValue; advance(); return new AbstractMap.SimpleImmutableEntry<K,V>(n.key, v); } } As a consequence, there is at least one object creation for every line in the hbase scanner. This creation is actually useless as we throw away the object immediatly. So, during the test several GC occur. I modified the MapEntryIterator implementation to iterate on the values. static class MapEntryIterator implements Iterator<KeyValue> { private final Iterator<KeyValue> iterator; public KeyValue next() { return this.iterator.next(); } The scan time is divided by 3 on the test. It can obviously be put to any arbitrary improvement ratio as it's driven by the GC execution, but it should be valuable in production as well. I am currently running the unit tests, I will add the patch if the execution is ok.
        Hide
        Nicolas Liochon added a comment -

        I will write the patch, that will be simpler .

        What is yet not clear to me is what we can expect when there are multiple threads using the same scanner, as the "readPoint" is a TLS. However, the patch will not change the current behavior.

        Show
        Nicolas Liochon added a comment - I will write the patch, that will be simpler . What is yet not clear to me is what we can expect when there are multiple threads using the same scanner, as the "readPoint" is a TLS. However, the patch will not change the current behavior.
        Hide
        Ted Yu added a comment -

        1) Replacement of KeyValue lowest = getLowest();

        It is in the "seek" function

        Another option is to share getThreadReadPoint() value for the two iterators

        N was talking about the following code in MemStore.next():

              if (theNext == kvsetNextRow) {
                kvsetNextRow = getNext(kvsetIt);
              } else {
                snapshotNextRow = getNext(snapshotIt);
              }
        

        The initiative was to save the system call.

        Show
        Ted Yu added a comment - 1) Replacement of KeyValue lowest = getLowest(); It is in the "seek" function Another option is to share getThreadReadPoint() value for the two iterators N was talking about the following code in MemStore.next(): if (theNext == kvsetNextRow) { kvsetNextRow = getNext(kvsetIt); } else { snapshotNextRow = getNext(snapshotIt); } The initiative was to save the system call.
        Hide
        stack added a comment -

        To me, it makes sense to precalculate the value returned by "peek", and reuse it in next().

        If there is no chance of the value changing between the peek and next, it sounds good (I've not looked at this code in a while).

        It would be great to save this system call in a next().

        Yes (I like how you figure there's a system call doing thread local get).

        In fact, as this value seems to be a TLS, I don't see how it could change during the execution of next(). What do you think?

        (I'm being lazy. I've not looked at the code). The updates to RWCC happen at well-defined points so should be easy enough to elicit if there is a problem w/ your presumption above.

        Last question on this: what is the use case when the getThreadReadPoint() will change during the whole scan (i.e.: between next)?

        IIRC, we want to let the scan see the most up-to-date view on a row though our guarantees are less than this (See http://hbase.apache.org/acid-semantics.html).

        Most of the public methods (except reseek) are "synchronized", it implies that the scanner can be shared between threads?

        That seems like a valid deduction to make.

        1) Replacement of KeyValue lowest = getLowest();

        You mean in MemStore#reseek? What would you put in its place (Sorry if I'm not following the bouncing ball properly).

        ...don't get the data getThreadReadPoint()

        So, we'd just hold to the current read point for how long? The full scan? That might be possible given our lax guarantees above though it would be nice to not have to give up on up to the millisecond views on rows.

        Another option is to share getThreadReadPoint() value for the two iterators, i.e. read the value in the next() function, and give it as a parameter to getNext()

        What are the 'two iterators' here?

        Sorry N, I don't have my head as deep in this stuff as you do currently so my questions and answers above may be off. Please compensate appropriately.

        Show
        stack added a comment - To me, it makes sense to precalculate the value returned by "peek", and reuse it in next(). If there is no chance of the value changing between the peek and next, it sounds good (I've not looked at this code in a while). It would be great to save this system call in a next(). Yes (I like how you figure there's a system call doing thread local get). In fact, as this value seems to be a TLS, I don't see how it could change during the execution of next(). What do you think? (I'm being lazy. I've not looked at the code). The updates to RWCC happen at well-defined points so should be easy enough to elicit if there is a problem w/ your presumption above. Last question on this: what is the use case when the getThreadReadPoint() will change during the whole scan (i.e.: between next)? IIRC, we want to let the scan see the most up-to-date view on a row though our guarantees are less than this (See http://hbase.apache.org/acid-semantics.html ). Most of the public methods (except reseek) are "synchronized", it implies that the scanner can be shared between threads? That seems like a valid deduction to make. 1) Replacement of KeyValue lowest = getLowest(); You mean in MemStore#reseek? What would you put in its place (Sorry if I'm not following the bouncing ball properly). ...don't get the data getThreadReadPoint() So, we'd just hold to the current read point for how long? The full scan? That might be possible given our lax guarantees above though it would be nice to not have to give up on up to the millisecond views on rows. Another option is to share getThreadReadPoint() value for the two iterators, i.e. read the value in the next() function, and give it as a parameter to getNext() What are the 'two iterators' here? Sorry N, I don't have my head as deep in this stuff as you do currently so my questions and answers above may be off. Please compensate appropriately.
        Hide
        Nicolas Liochon added a comment -

        Hello Stack,

        accesses and perhaps to make it go faster.
        I will have a look at it, I see as well in this test and in the global
        profiling that a lot of time is spent on it.

        scanner.next();

        There are two iterators in the class(kvsetIt and snapshotIt), and getLowest
        compare the two to return the lowest. However, in this test, one of the list
        is empty, so the value is null, and hence the real comparison on byte[] is
        not executed.

        On this subject, there is a possible optimisation on the function "peek",
        that will repeat the comparison: if peek is called multiple time, of if we
        often have peek() then next(), we can save the redundant comparisons. To me,
        it makes sense to precalculate the value returned by "peek", and reuse it in
        next().

        The profiling (method: sampling, java inlining desactivated) says something
        interesting:

        Name; total time spent
        org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.next() 100%
        org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.getNext(Iterator)
        88%
        org.apache.hadoop.hbase.regionserver.KeyValueSkipListSet$MapEntryIterator.next()
        44%
        java.util.concurrent.ConcurrentSkipListMap$SubMap$SubMapEntryIterator.next()
        36%
        org.apache.hadoop.hbase.regionserver.ReadWriteConsistencyControl.getThreadReadPoint()
        26%
        java.lang.ThreadLocal.get() 21%
        org.apache.hadoop.hbase.regionserver.KeyValueSkipListSet$MapEntryIterator.hasNext()
        8%
        org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.getLowest()
        7%
        java.util.concurrent.ConcurrentSkipListMap$SubMap$SubMapIter.hasNext() 3%
        org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.getLower(KeyValue,
        KeyValue) 3%
        java.lang.Long.longValue() 2%

        So we're spending 26% of the time on this:
        org.apache.hadoop.hbase.regionserver.ReadWriteConsistencyControl.getThreadReadPoint()
        26%

        And in this getThreadReadPoint(), the actual time is spent in:
        java.lang.ThreadLocal.get() 21%

        It's a TLS, so we can expect a system call to get the thread id. It would be
        great to save this system call in a next().

        There is at least an improvement for the case when one of the list is done:
        don't get the data getThreadReadPoint(). That would not change the behaviour
        at all, but would already be interesting (may be 10% in this test).
        Another option is to share getThreadReadPoint() value for the two iterators,
        i.e. read the value in the next() function, and give it as a parameter to
        getNext(). In fact, as this value seems to be a TLS, I don't see how it
        could change during the execution of next(). What do you think?
        Last question on this: what is the use case when the getThreadReadPoint()
        will change during the whole scan (i.e.: between next)?

        Most of the public methods (except reseek) are "synchronized", it implies
        that the scanner can be shared between threads?

        At the end, it seems that there are 3 possible things to do:
        1) Replacement of KeyValue lowest = getLowest();
        2) theNext precalculation for peek() and next()
        3) Depending on your feedback, one of the options above on
        getThreadReadPoint().

        This should give 5 to 15% increase in performances, not a "problem solved"
        stuff, but could justify a first patch. I can do it (with the hbase
        indenting

        On Sun, Jul 24, 2011 at 12:23 AM, stack (JIRA) <jira@apache.org> wrote:

        Show
        Nicolas Liochon added a comment - Hello Stack, accesses and perhaps to make it go faster. I will have a look at it, I see as well in this test and in the global profiling that a lot of time is spent on it. scanner.next(); There are two iterators in the class(kvsetIt and snapshotIt), and getLowest compare the two to return the lowest. However, in this test, one of the list is empty, so the value is null, and hence the real comparison on byte[] is not executed. On this subject, there is a possible optimisation on the function "peek", that will repeat the comparison: if peek is called multiple time, of if we often have peek() then next(), we can save the redundant comparisons. To me, it makes sense to precalculate the value returned by "peek", and reuse it in next(). The profiling (method: sampling, java inlining desactivated) says something interesting: Name; total time spent org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.next() 100% org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.getNext(Iterator) 88% org.apache.hadoop.hbase.regionserver.KeyValueSkipListSet$MapEntryIterator.next() 44% java.util.concurrent.ConcurrentSkipListMap$SubMap$SubMapEntryIterator.next() 36% org.apache.hadoop.hbase.regionserver.ReadWriteConsistencyControl.getThreadReadPoint() 26% java.lang.ThreadLocal.get() 21% org.apache.hadoop.hbase.regionserver.KeyValueSkipListSet$MapEntryIterator.hasNext() 8% org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.getLowest() 7% java.util.concurrent.ConcurrentSkipListMap$SubMap$SubMapIter.hasNext() 3% org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.getLower(KeyValue, KeyValue) 3% java.lang.Long.longValue() 2% So we're spending 26% of the time on this: org.apache.hadoop.hbase.regionserver.ReadWriteConsistencyControl.getThreadReadPoint() 26% And in this getThreadReadPoint(), the actual time is spent in: java.lang.ThreadLocal.get() 21% It's a TLS, so we can expect a system call to get the thread id. It would be great to save this system call in a next(). There is at least an improvement for the case when one of the list is done: don't get the data getThreadReadPoint(). That would not change the behaviour at all, but would already be interesting (may be 10% in this test). Another option is to share getThreadReadPoint() value for the two iterators, i.e. read the value in the next() function, and give it as a parameter to getNext(). In fact, as this value seems to be a TLS, I don't see how it could change during the execution of next(). What do you think? Last question on this: what is the use case when the getThreadReadPoint() will change during the whole scan (i.e.: between next)? Most of the public methods (except reseek) are "synchronized", it implies that the scanner can be shared between threads? At the end, it seems that there are 3 possible things to do: 1) Replacement of KeyValue lowest = getLowest(); 2) theNext precalculation for peek() and next() 3) Depending on your feedback, one of the options above on getThreadReadPoint(). This should give 5 to 15% increase in performances, not a "problem solved" stuff, but could justify a first patch. I can do it (with the hbase indenting On Sun, Jul 24, 2011 at 12:23 AM, stack (JIRA) <jira@apache.org> wrote:
        Hide
        vamshi added a comment -

        Hi stack, how can we perform lookup/scanning in HBase? can we use Distributed Hashing (DHT) for that? I want to implement scalable data structure i.e DHT in HBase, for how can i proceed? please help me..Thank you

        Show
        vamshi added a comment - Hi stack, how can we perform lookup/scanning in HBase? can we use Distributed Hashing (DHT) for that? I want to implement scalable data structure i.e DHT in HBase, for how can i proceed? please help me..Thank you
        Hide
        stack added a comment -

        I modified the unit test to make it work with the trunk as it is today (new file attached).

        Thanks.

        Reviewing it, one thing you might want to do is study classes in hbase so get gist of the hadoop/hbase style. Notice how they have two spaces for tabs, ~80 chars a line. But thats for future. Not important here.

        You just need to make sure your KVs have a readPoint that is less than the current readPoint. It looks like you are making KVs w/o setting memstorets. Default then is used and its zero. The default read point is zero. The compare is <= so it looks like you don't need to set the read point at all. What you have should be no harm.

        Your new test class seems fine. Would be nice to add more tests. As memstore data structure grows, all slows.

        Another issue is about hacking on the concurrentskiplistset that is memstore to make it more suited to our accesses and perhaps to make it go faster (its public domain when you dig down into the java src).

        On a scan the "next()" part, the hbase currently compare the value of two internals iterators. In this test, the second list is always empty, hence the cost on comparator is lowered vs. real life.

        What is this that you are referring too? Is it this? KeyValue kv = scanner.next();

        But I don't think it worth a patch just for this (it should be included in a bigger patch hoewever).

        Up to you but yes, the above is probably the way to go.

        Thanks N.

        Show
        stack added a comment - I modified the unit test to make it work with the trunk as it is today (new file attached). Thanks. Reviewing it, one thing you might want to do is study classes in hbase so get gist of the hadoop/hbase style. Notice how they have two spaces for tabs, ~80 chars a line. But thats for future. Not important here. You just need to make sure your KVs have a readPoint that is less than the current readPoint. It looks like you are making KVs w/o setting memstorets. Default then is used and its zero. The default read point is zero. The compare is <= so it looks like you don't need to set the read point at all. What you have should be no harm. Your new test class seems fine. Would be nice to add more tests. As memstore data structure grows, all slows. Another issue is about hacking on the concurrentskiplistset that is memstore to make it more suited to our accesses and perhaps to make it go faster (its public domain when you dig down into the java src). On a scan the "next()" part, the hbase currently compare the value of two internals iterators. In this test, the second list is always empty, hence the cost on comparator is lowered vs. real life. What is this that you are referring too? Is it this? KeyValue kv = scanner.next(); But I don't think it worth a patch just for this (it should be included in a bigger patch hoewever). Up to you but yes, the above is probably the way to go. Thanks N.
        Hide
        Nicolas Liochon added a comment -

        I modified the unit test to make it work with the trunk as it is today (new file attached). It worth reviewing, I set a magic value for setThreadReadPoint, I don't know if it is the right thing to do.

        I also added a loop on the list size to make visible any exponential cost.

        On a scan the "next()" part, the hbase currently compare the value of two internals iterators. In this test, the second list is always empty, hence the cost on comparator is lowered vs. real life. I don't know if it is a side effect of my modifications.

        There is a trival optimization in the "seek" function.
        KeyValue lowest = getLowest();
        return lowest != null;

        Could be replaced by:
        return (kvsetNextRow != null || snapshotNextRow != null);

        But I don't think it worth a patch just for this (it should be included in a bigger patch hoewever). If you think differently, I can do it.

        Show
        Nicolas Liochon added a comment - I modified the unit test to make it work with the trunk as it is today (new file attached). It worth reviewing, I set a magic value for setThreadReadPoint, I don't know if it is the right thing to do. I also added a loop on the list size to make visible any exponential cost. On a scan the "next()" part, the hbase currently compare the value of two internals iterators. In this test, the second list is always empty, hence the cost on comparator is lowered vs. real life. I don't know if it is a side effect of my modifications. There is a trival optimization in the "seek" function. KeyValue lowest = getLowest(); return lowest != null; Could be replaced by: return (kvsetNextRow != null || snapshotNextRow != null); But I don't think it worth a patch just for this (it should be included in a bigger patch hoewever). If you think differently, I can do it.
        Hide
        stack added a comment -

        Moved out of 0.92 (It looks like Andrew Purtell and Dhruba have revived interest in this issue – let them bring it back in if its same thing).

        Show
        stack added a comment - Moved out of 0.92 (It looks like Andrew Purtell and Dhruba have revived interest in this issue – let them bring it back in if its same thing).
        Hide
        stack added a comment -

        Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.

        Show
        stack added a comment - Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.
        Hide
        stack added a comment -

        My little perpetual scanning test.

        Show
        stack added a comment - My little perpetual scanning test.
        Hide
        stack added a comment -

        Small change to KV. Uses 8 bytes more memory to cache Key Length.

        Show
        stack added a comment - Small change to KV. Uses 8 bytes more memory to cache Key Length.
        Hide
        stack added a comment -

        Profiling in-memory scanning of MemStore:

        + All time is in cacheNextRow as you'd expect. 41% of CPU is doing SortedSet#first, 24% in making an Iterator, and 22% doing the isEmpty test (which calls the #first method). Each of these methods end up in KVComparator.compare. All our scanning time is doing compares. Alot of time is spent making up ints and longs out of bytes; e.g. getKeyLength and getRowLength. I can see that some of these constructions – e.g. getKeyLength – happen multiple times in a single scan for a single KV (imagine if multiple concurrent scans). This would seem to argue that we cache the construction of lengths but there'd be an associated memory cost.... maybe do it for just a few of these lengths? Key length? For example, calculating keylength once on construction would seem to make scanning near 30% faster in simple test.

        Without caching of KeyLength:

        Loaded
        Scan: 2406
        Scan: 1685
        Scan: 1656
        Scan: 1655
        Scan: 1646
        Scan: 1647
        Scan: 1646

        With caching of KeyLength:
        Loaded
        Scan: 1970
        Scan: 1282
        Scan: 1292
        Scan: 1252
        Scan: 1273
        Scan: 1272
        Scan: 1284
        Scan: 1220
        ..

        Let me attach patches that have amended test and the change I made to KV.

        + The "reputable lads" mentioned above think our getting tailSet for each cache of row content is wasteful, that we should be able do to better – say take out iterator once and keep it for life of the Scanner. On snapshot, we'd have to poke all outstanding Scanners to readjust themselves. Looking at the numbers, though actually taking a tailset is surprisingly inexpensive, the tests for isEmpty and creation of Iterator each time are bulk of CPU. Let me play with changing the MemStoreScanner implementation to be just a set.

        Show
        stack added a comment - Profiling in-memory scanning of MemStore: + All time is in cacheNextRow as you'd expect. 41% of CPU is doing SortedSet#first, 24% in making an Iterator, and 22% doing the isEmpty test (which calls the #first method). Each of these methods end up in KVComparator.compare. All our scanning time is doing compares. Alot of time is spent making up ints and longs out of bytes; e.g. getKeyLength and getRowLength. I can see that some of these constructions – e.g. getKeyLength – happen multiple times in a single scan for a single KV (imagine if multiple concurrent scans). This would seem to argue that we cache the construction of lengths but there'd be an associated memory cost.... maybe do it for just a few of these lengths? Key length? For example, calculating keylength once on construction would seem to make scanning near 30% faster in simple test. Without caching of KeyLength: Loaded Scan: 2406 Scan: 1685 Scan: 1656 Scan: 1655 Scan: 1646 Scan: 1647 Scan: 1646 With caching of KeyLength: Loaded Scan: 1970 Scan: 1282 Scan: 1292 Scan: 1252 Scan: 1273 Scan: 1272 Scan: 1284 Scan: 1220 .. Let me attach patches that have amended test and the change I made to KV. + The "reputable lads" mentioned above think our getting tailSet for each cache of row content is wasteful, that we should be able do to better – say take out iterator once and keep it for life of the Scanner. On snapshot, we'd have to poke all outstanding Scanners to readjust themselves. Looking at the numbers, though actually taking a tailset is surprisingly inexpensive, the tests for isEmpty and creation of Iterator each time are bulk of CPU. Let me play with changing the MemStoreScanner implementation to be just a set.
        Hide
        stack added a comment -

        A little test that loads up a MemStore then does a Scan. For profiling.

        Show
        stack added a comment - A little test that loads up a MemStore then does a Scan. For profiling.
        Hide
        stack added a comment -

        Some reputable lads have been looking at in-memory scanning and have turned up some interesting observations:

        + Getting values out of the block cache is way faster than getting them out of MemStore – even though the values are prefabricated as KVs in MemStore and have to be instantiated when pulling from the block cache.
        + Getting values out of the block cache is way faster than getting values out of preloaded custom-MemStore that sits in front of our current MemStore – with its handling of snapshot.
        + Getting values out of the block cache is faster than getting values from a plain Set of KVs (!!!)

        "I've noticed that the MemStoreScanner implementation is very inefficient: it basically does a search for each row since it calls tailSet. IMO this should be changed - memstore snapshots should be handles as a special case and not slow down all scans."

        Getting rid of the tailsetting – keeping the tailset for life of scan – improved MemStore performance by 40%.

        Other observations:

        + We copy row out of KV to do compares more than once during processing of a Scan (Add a counter to see for sure). Maybe we can avoid above by using a Comparator that can do the compare in place?
        + Looking for equality, it might be faster comparing from end to start rather than from start to end since more often, its in the tail the difference is:

           public static boolean equals(final byte [] left, final byte [] right) {
             // Could use Arrays.equals?
        -    return left == null && right == null? true:
        -      (left == null || right == null || (left.length != right.length))? false:
        -        compareTo(left, right) == 0;
        +    if( left == null && right == null? true:
        +      (left == null || right == null || (left.length != right.length))? false: true) {
        +    // compre from end to start since we usually compare 'close' bytes
        +    int last = left.length-1;
        +    for (int i=last; i>=0; i--){
        +      if (left[i] != right[i]) {
        +        return false;
        +      }
        +    }
        +      return true;
        +    }
        +    return false;
           }
        
        Show
        stack added a comment - Some reputable lads have been looking at in-memory scanning and have turned up some interesting observations: + Getting values out of the block cache is way faster than getting them out of MemStore – even though the values are prefabricated as KVs in MemStore and have to be instantiated when pulling from the block cache. + Getting values out of the block cache is way faster than getting values out of preloaded custom-MemStore that sits in front of our current MemStore – with its handling of snapshot. + Getting values out of the block cache is faster than getting values from a plain Set of KVs (!!!) "I've noticed that the MemStoreScanner implementation is very inefficient: it basically does a search for each row since it calls tailSet. IMO this should be changed - memstore snapshots should be handles as a special case and not slow down all scans." Getting rid of the tailsetting – keeping the tailset for life of scan – improved MemStore performance by 40%. Other observations: + We copy row out of KV to do compares more than once during processing of a Scan (Add a counter to see for sure). Maybe we can avoid above by using a Comparator that can do the compare in place? + Looking for equality, it might be faster comparing from end to start rather than from start to end since more often, its in the tail the difference is: public static boolean equals( final byte [] left, final byte [] right) { // Could use Arrays.equals? - return left == null && right == null ? true : - (left == null || right == null || (left.length != right.length))? false : - compareTo(left, right) == 0; + if ( left == null && right == null ? true : + (left == null || right == null || (left.length != right.length))? false : true ) { + // compre from end to start since we usually compare 'close' bytes + int last = left.length-1; + for ( int i=last; i>=0; i--){ + if (left[i] != right[i]) { + return false ; + } + } + return true ; + } + return false ; }

          People

          • Assignee:
            Nicolas Liochon
            Reporter:
            stack
          • Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development