HBase
  1. HBase
  2. HBASE-5898

Consider double-checked locking for block cache lock

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.94.1
    • Fix Version/s: 0.94.3, 0.95.0
    • Component/s: Performance
    • Labels:
      None

      Description

      Running a workload with a high query rate against a dataset that fits in cache, I saw a lot of CPU being used in IdLock.getLockEntry, being called by HFileReaderV2.readBlock. Even though it was all cache hits, it was wasting a lot of CPU doing lock management here. I wrote a quick patch to switch to a double-checked locking and it improved throughput substantially for this workload.

      1. 5898-0.94.txt
        30 kB
        Lars Hofhansl
      2. 5898-v4.txt
        30 kB
        Lars Hofhansl
      3. 5898-v4.txt
        30 kB
        Lars Hofhansl
      4. 5898-v3.txt
        8 kB
        Lars Hofhansl
      5. 5898-v2.txt
        6 kB
        Lars Hofhansl
      6. HBASE-5898-1.patch
        6 kB
        stack
      7. HBASE-5898-1.patch
        6 kB
        Elliott Clark
      8. HBASE-5898-0.patch
        6 kB
        Elliott Clark
      9. 5898-TestBlocksRead.txt
        3 kB
        Ted Yu
      10. hbase-5898.txt
        2 kB
        Todd Lipcon

        Issue Links

          Activity

          Todd Lipcon created issue -
          Hide
          Todd Lipcon added a comment -

          Here's a patch which implements the double checked locking. The first attempt to look up the value does so without locking it. If it fails, then it tries to acquire the lock, and checks again. If the re-check fails, then it actually goes to disk. Assuming that going to disk is slow, the extra check of cache should be inconsequential, but this made a big difference in throughput of a read-only in-cache workload.

          Show
          Todd Lipcon added a comment - Here's a patch which implements the double checked locking. The first attempt to look up the value does so without locking it. If it fails, then it tries to acquire the lock, and checks again. If the re-check fails, then it actually goes to disk. Assuming that going to disk is slow, the extra check of cache should be inconsequential, but this made a big difference in throughput of a read-only in-cache workload.
          Todd Lipcon made changes -
          Field Original Value New Value
          Attachment hbase-5898.txt [ 12525001 ]
          Hide
          Ted Yu added a comment -

          Interesting idea.
          Minor comments:
          The indentation for while (true) loop is off.
          Changes to conf/hbase-site.xml belong to another JIRA.

          Show
          Ted Yu added a comment - Interesting idea. Minor comments: The indentation for while (true) loop is off. Changes to conf/hbase-site.xml belong to another JIRA.
          Hide
          Todd Lipcon added a comment -

          this isn't meant to be a final patch - I'm just proposing the idea and put the patch up to illustrate what I meant. It needs more benchmarking in realistic scenarios to know if it's generally a good idea or not

          Show
          Todd Lipcon added a comment - this isn't meant to be a final patch - I'm just proposing the idea and put the patch up to illustrate what I meant. It needs more benchmarking in realistic scenarios to know if it's generally a good idea or not
          Hide
          Ted Yu added a comment -

          Consider the case where off heap cache is enabled.
          From DoubleBlockCache:
          Suppose getBlock() is executed without the lock (first pass in the new loop of readBlock) and doesn't find cacheKey from onHeapCache but finds it in offHeapCache - it will call onHeapCache.cacheBlock():

            public Cacheable getBlock(BlockCacheKey cacheKey, boolean caching) {
              Cacheable cachedBlock;
          
              if ((cachedBlock = onHeapCache.getBlock(cacheKey, caching)) != null) {
                stats.hit(caching);
                return cachedBlock;
          
              } else if ((cachedBlock = offHeapCache.getBlock(cacheKey, caching)) != null) {
                if (caching) {
                  onHeapCache.cacheBlock(cacheKey, cachedBlock);
                }
          

          Another thread calls cacheBlock() around the same time and executes onHeapCache.cacheBlock() for the same cacheKey:

            public void cacheBlock(BlockCacheKey cacheKey, Cacheable buf) {
              onHeapCache.cacheBlock(cacheKey, buf);
              offHeapCache.cacheBlock(cacheKey, buf);
            }
          

          I think there is a race condition which didn't exist before the proposed change: the entries for the same cacheKey in onHeapCache and offHeapCache would diverge.

          If off heap cache is disabled, I don't see problem with proposed optimization.

          Show
          Ted Yu added a comment - Consider the case where off heap cache is enabled. From DoubleBlockCache: Suppose getBlock() is executed without the lock (first pass in the new loop of readBlock) and doesn't find cacheKey from onHeapCache but finds it in offHeapCache - it will call onHeapCache.cacheBlock(): public Cacheable getBlock(BlockCacheKey cacheKey, boolean caching) { Cacheable cachedBlock; if ((cachedBlock = onHeapCache.getBlock(cacheKey, caching)) != null ) { stats.hit(caching); return cachedBlock; } else if ((cachedBlock = offHeapCache.getBlock(cacheKey, caching)) != null ) { if (caching) { onHeapCache.cacheBlock(cacheKey, cachedBlock); } Another thread calls cacheBlock() around the same time and executes onHeapCache.cacheBlock() for the same cacheKey: public void cacheBlock(BlockCacheKey cacheKey, Cacheable buf) { onHeapCache.cacheBlock(cacheKey, buf); offHeapCache.cacheBlock(cacheKey, buf); } I think there is a race condition which didn't exist before the proposed change: the entries for the same cacheKey in onHeapCache and offHeapCache would diverge. If off heap cache is disabled, I don't see problem with proposed optimization.
          Hide
          Ted Yu added a comment -

          Some changes to TestBlocksRead are needed to make it pass.

          See if the increase in blocks read is acceptable.

          Show
          Ted Yu added a comment - Some changes to TestBlocksRead are needed to make it pass. See if the increase in blocks read is acceptable.
          Ted Yu made changes -
          Attachment 5898-TestBlocksRead.txt [ 12525210 ]
          Hide
          Jean-Daniel Cryans added a comment -

          Once this is fixed I think we can close HBASE-5000.

          Show
          Jean-Daniel Cryans added a comment - Once this is fixed I think we can close HBASE-5000 .
          Hide
          Todd Lipcon added a comment -

          Really? I don't think this solves the same problem as HBASE-5000. This addresses the case where there's a really high cache hit ratio, whereas that one addresses the case where there's a 0% cache hit ratio.

          Show
          Todd Lipcon added a comment - Really? I don't think this solves the same problem as HBASE-5000 . This addresses the case where there's a really high cache hit ratio, whereas that one addresses the case where there's a 0% cache hit ratio.
          Hide
          stack added a comment -

          @Todd Double-checked locking should "...should usually be avoided." [1].

          Looking at block cache, it looks like a block should be fully initialized before its added to the cache so we should avoid the horror stories detailed in the article.

          Let me try take it for a run...

          1. http://en.wikipedia.org/wiki/Double-checked_locking#Usage_in_Java

          Show
          stack added a comment - @Todd Double-checked locking should "...should usually be avoided." [1] . Looking at block cache, it looks like a block should be fully initialized before its added to the cache so we should avoid the horror stories detailed in the article. Let me try take it for a run... 1. http://en.wikipedia.org/wiki/Double-checked_locking#Usage_in_Java
          Hide
          Todd Lipcon added a comment -

          @Todd Double-checked locking should "...should usually be avoided." [1].

          I am above the law, Stack, didn't you know that?

          Seriously, though, I think the use of the idiom here is safe.

          Show
          Todd Lipcon added a comment - @Todd Double-checked locking should "...should usually be avoided." [1] . I am above the law, Stack, didn't you know that? Seriously, though, I think the use of the idiom here is safe.
          Hide
          stack added a comment -

          Marking this critical. Seems like small change w/ big win. All that is missing is a bit of exercise while under heavy read load.

          Show
          stack added a comment - Marking this critical. Seems like small change w/ big win. All that is missing is a bit of exercise while under heavy read load.
          stack made changes -
          Priority Major [ 3 ] Critical [ 2 ]
          Hide
          stack added a comment -

          Assigning Todd.

          Show
          stack added a comment - Assigning Todd.
          stack made changes -
          Assignee Todd Lipcon [ tlipcon ]
          Hide
          Todd Lipcon added a comment -

          The downside is that for a workload with a lower cache hit ratio, we'll ask the cache twice for every cache miss instead of just once. So we should see what the performance difference is for something like a 10MB cache with 1GB dataset, where the dataset fits in the OS buffer cache.

          Show
          Todd Lipcon added a comment - The downside is that for a workload with a lower cache hit ratio, we'll ask the cache twice for every cache miss instead of just once. So we should see what the performance difference is for something like a 10MB cache with 1GB dataset, where the dataset fits in the OS buffer cache.
          Hide
          stack added a comment -

          Thanks for the prescription Todd.

          Show
          stack added a comment - Thanks for the prescription Todd.
          Hide
          Jean-Daniel Cryans added a comment -

          This addresses the case where there's a really high cache hit ratio, whereas that one addresses the case where there's a 0% cache hit ratio.

          Now that I actually read the patch I see how I was wrong, sorry for the noise.

          Show
          Jean-Daniel Cryans added a comment - This addresses the case where there's a really high cache hit ratio, whereas that one addresses the case where there's a 0% cache hit ratio. Now that I actually read the patch I see how I was wrong, sorry for the noise.
          Hide
          Jean-Daniel Cryans added a comment -

          The downside is that for a workload with a lower cache hit ratio, we'll ask the cache twice for every cache miss instead of just once.

          Won't this also skew the cache stats themselves?

          Show
          Jean-Daniel Cryans added a comment - The downside is that for a workload with a lower cache hit ratio, we'll ask the cache twice for every cache miss instead of just once. Won't this also skew the cache stats themselves?
          Hide
          stack added a comment -

          j-d Can you run your blockcache cache checker against this? That'd probably be the best way of figuring if there a problem here (do you have a verify you got the right block; i.e. do a read from the block?)

          Show
          stack added a comment - j-d Can you run your blockcache cache checker against this? That'd probably be the best way of figuring if there a problem here (do you have a verify you got the right block; i.e. do a read from the block?)
          Hide
          Jean-Daniel Cryans added a comment -

          A not too scientific comparison.

          Before:

          total=6.26 GB, free=1.04 GB, max=7.3 GB, blocks=101753, accesses=5078311, hits=4333154, hitRatio=85.32%, cachingAccesses=5078311, cachingHits=4333154, cachingHitsRatio=85.32%, evictions=54, evicted=643404, evictedPerRun=11914.888671875

          After:

          total=6.45 GB, free=878.07 MB, max=7.3 GB, blocks=104747, accesses=5810869, hits=4345850, hitRatio=74.78%, cachingAccesses=5810869, cachingHits=4345850, cachingHitsRatio=74.78%, evictions=52, evicted=627755, evictedPerRun=12072.2119140625

          This is using the benchmark tool I'm working on right now, I had to port Todd's patch since I'm not going through the HFile code but I'm doing exactly what he does. The test is started completely cold. It uses 5 threads.

          The big takeout is that it did 14% more cache accesses which had the side effect of lowering the hit ratio. The number of actual cache hits was a big higher.

          Show
          Jean-Daniel Cryans added a comment - A not too scientific comparison. Before: total=6.26 GB, free=1.04 GB, max=7.3 GB, blocks=101753, accesses=5078311, hits=4333154, hitRatio=85.32%, cachingAccesses=5078311, cachingHits=4333154, cachingHitsRatio=85.32%, evictions=54, evicted=643404, evictedPerRun=11914.888671875 After: total=6.45 GB, free=878.07 MB, max=7.3 GB, blocks=104747, accesses=5810869, hits=4345850, hitRatio=74.78%, cachingAccesses=5810869, cachingHits=4345850, cachingHitsRatio=74.78%, evictions=52, evicted=627755, evictedPerRun=12072.2119140625 This is using the benchmark tool I'm working on right now, I had to port Todd's patch since I'm not going through the HFile code but I'm doing exactly what he does. The test is started completely cold. It uses 5 threads. The big takeout is that it did 14% more cache accesses which had the side effect of lowering the hit ratio. The number of actual cache hits was a big higher.
          Hide
          stack added a comment -

          Can you do a test where much lower hit ratio? Any way to compare the amount of 'work' done by the cache or how latency is effected? I'm wondering if you can use your tool to simulate the below:

          The downside is that for a workload with a lower cache hit ratio, we'll ask the cache twice for every cache miss instead of just once. So we should see what the performance difference is for something like a 10MB cache with 1GB dataset, where the dataset fits in the OS buffer cache.

          If you can't, is it something you should make it measure?

          Show
          stack added a comment - Can you do a test where much lower hit ratio? Any way to compare the amount of 'work' done by the cache or how latency is effected? I'm wondering if you can use your tool to simulate the below: The downside is that for a workload with a lower cache hit ratio, we'll ask the cache twice for every cache miss instead of just once. So we should see what the performance difference is for something like a 10MB cache with 1GB dataset, where the dataset fits in the OS buffer cache. If you can't, is it something you should make it measure?
          Hide
          Jean-Daniel Cryans added a comment -

          Can you do a test where much lower hit ratio?

          Here it is with 100MB for a ~10GB working data set:

          Before:

          total=84.22 MB, free=15.78 MB, max=100 MB, blocks=1288, accesses=5078238, hits=3186246, hitRatio=62.74%, cachingAccesses=5078238, cachingHits=3186246, cachingHitsRatio=62.74%, evictions=11196, evicted=1890704, evictedPerRun=168.8731689453125

          real 1m12.487s
          user 8m54.470s
          sys 0m23.850s

          After:

          LRU Stats: total=84.09 MB, free=15.91 MB, max=100 MB, blocks=1286, accesses=6967721, hits=3189074, hitRatio=45.76%, cachingAccesses=6967721, cachingHits=3189074, cachingHitsRatio=45.76%, evictions=10876, evicted=1887926, evictedPerRun=173.58642578125

          real 1m12.614s
          user 9m34.990s
          sys 0m21.390s

          Again the hit ratio is much worse but in reality the hits were about the same in both cases. Included is the "time" output for those runs, the user cpu is somewhat higher (expected) but the total run time is really the same.

          Show
          Jean-Daniel Cryans added a comment - Can you do a test where much lower hit ratio? Here it is with 100MB for a ~10GB working data set: Before: total=84.22 MB, free=15.78 MB, max=100 MB, blocks=1288, accesses=5078238, hits=3186246, hitRatio=62.74%, cachingAccesses=5078238, cachingHits=3186246, cachingHitsRatio=62.74%, evictions=11196, evicted=1890704, evictedPerRun=168.8731689453125 real 1m12.487s user 8m54.470s sys 0m23.850s After: LRU Stats: total=84.09 MB, free=15.91 MB, max=100 MB, blocks=1286, accesses=6967721, hits=3189074, hitRatio=45.76%, cachingAccesses=6967721, cachingHits=3189074, cachingHitsRatio=45.76%, evictions=10876, evicted=1887926, evictedPerRun=173.58642578125 real 1m12.614s user 9m34.990s sys 0m21.390s Again the hit ratio is much worse but in reality the hits were about the same in both cases. Included is the "time" output for those runs, the user cpu is somewhat higher (expected) but the total run time is really the same.
          Hide
          Todd Lipcon added a comment -

          Shouldn't be hard to get that CPU time back – we can just add an array of cacheline-padded AtomicLongs to the cache. Whenever we add something to the cache, we do changeCounters[key.hashCode() % changeCounters.length].getAndIncrement(). Then change the code to:

          AtomicLong changeCounter = changeCounters[key.hashCode() % changeCounters.length];
          long firstTimeChangeCounter = changeCounter.get();
          first time:
            try to look up in cache
            if found: return it
          second time:
            take lock:
              if changeCounter.get() == firstTimeChangeCounter: it's not in cache
              otherwise: look up in cache again
          

          We'd probably want to cache-pad the AtomicLongs too to avoid false sharing.

          Show
          Todd Lipcon added a comment - Shouldn't be hard to get that CPU time back – we can just add an array of cacheline-padded AtomicLongs to the cache. Whenever we add something to the cache, we do changeCounters [key.hashCode() % changeCounters.length] .getAndIncrement() . Then change the code to: AtomicLong changeCounter = changeCounters[key.hashCode() % changeCounters.length]; long firstTimeChangeCounter = changeCounter.get(); first time: try to look up in cache if found: return it second time: take lock: if changeCounter.get() == firstTimeChangeCounter: it's not in cache otherwise: look up in cache again We'd probably want to cache-pad the AtomicLongs too to avoid false sharing.
          Hide
          Mikhail Bautin added a comment -

          Todd: where do we increment those AtomicLong counters?

          Show
          Mikhail Bautin added a comment - Todd: where do we increment those AtomicLong counters?
          Hide
          Todd Lipcon added a comment -

          Mikhail: whenever we insert something into the cache, we'd increment the counter for that hash bucket.

          Show
          Todd Lipcon added a comment - Mikhail: whenever we insert something into the cache, we'd increment the counter for that hash bucket.
          Hide
          Mikhail Bautin added a comment -

          Is there any way to avoid messing up cache hit ratio with these optimizations? Maybe pass a flag to getBlock saying we should not update the metrics.

          Another suggestion from Kannan: maybe we should move metric updates and other extra logic outside of the lock, and only leave cache lookup and HDFS I/O inside the lock.

          Show
          Mikhail Bautin added a comment - Is there any way to avoid messing up cache hit ratio with these optimizations? Maybe pass a flag to getBlock saying we should not update the metrics. Another suggestion from Kannan: maybe we should move metric updates and other extra logic outside of the lock, and only leave cache lookup and HDFS I/O inside the lock.
          Hide
          Todd Lipcon added a comment -

          Yep, I agree we need to fix the metrics messup before this is a candidate for commit.

          I like the idea of moving the metric updates out.

          Show
          Todd Lipcon added a comment - Yep, I agree we need to fix the metrics messup before this is a candidate for commit. I like the idea of moving the metric updates out.
          Liyin Tang made changes -
          Link This issue is related to HBASE-5987 [ HBASE-5987 ]
          Hide
          Lars Hofhansl added a comment -

          Stack mentioned today that he saw a problem very similar to this.

          Show
          Lars Hofhansl added a comment - Stack mentioned today that he saw a problem very similar to this.
          Hide
          Elliott Clark added a comment -

          We saw this on a production cluster.

          A single data block was being requested by all ipc handlers. This caused a great deal of lock contention. Todd's patch fixed the issue and caused reads to go back to a normal duration.

          This patch is almost identical to Todd's patch. It makes the second try to get a block from the block cache not count as a miss.

          Show
          Elliott Clark added a comment - We saw this on a production cluster. A single data block was being requested by all ipc handlers. This caused a great deal of lock contention. Todd's patch fixed the issue and caused reads to go back to a normal duration. This patch is almost identical to Todd's patch. It makes the second try to get a block from the block cache not count as a miss.
          Elliott Clark made changes -
          Attachment HBASE-5898-0.patch [ 12550536 ]
          Hide
          Lars Hofhansl added a comment -

          I'd be still worried about checking the cache multiple times especially when the hit ratio is low.
          On the other hand in that case the cost of checking the cache twice is probably eclipsed by the cost of actually loading the block... So maybe it's nothing to worry about.

          Show
          Lars Hofhansl added a comment - I'd be still worried about checking the cache multiple times especially when the hit ratio is low. On the other hand in that case the cost of checking the cache twice is probably eclipsed by the cost of actually loading the block... So maybe it's nothing to worry about.
          Hide
          Elliott Clark added a comment -

          Since it's possible that the thread blocked behind a reader that went to hdfs for the block, I think checking the cache is most likely to be a win.

          I have a demo app that hits one block multi-threaded. I'll try and get some concrete numbers up to back up my thoughts.

          Show
          Elliott Clark added a comment - Since it's possible that the thread blocked behind a reader that went to hdfs for the block, I think checking the cache is most likely to be a win. I have a demo app that hits one block multi-threaded. I'll try and get some concrete numbers up to back up my thoughts.
          Hide
          Lars Hofhansl added a comment -

          That's a good theory
          I guess the worst case here is the uncontended case with lots of cache misses.
          For example a single threaded large scan over data not in the block cache; in that case we'll always needlessly go around twice in that loop.

          Show
          Lars Hofhansl added a comment - That's a good theory I guess the worst case here is the uncontended case with lots of cache misses. For example a single threaded large scan over data not in the block cache; in that case we'll always needlessly go around twice in that loop.
          Lars Hofhansl made changes -
          Fix Version/s 0.94.3 [ 12323144 ]
          Fix Version/s 0.96.0 [ 12320040 ]
          Hide
          Lars Hofhansl added a comment -

          Instead of checking blockReadTryCount != 2 we could also check useLock, right? It should be true exactly when blockReadTryCount != 2.

          Show
          Lars Hofhansl added a comment - Instead of checking blockReadTryCount != 2 we could also check useLock, right? It should be true exactly when blockReadTryCount != 2 .
          Hide
          Lars Hofhansl added a comment -

          I meant it is false when blockReadTryCount != 2

          Show
          Lars Hofhansl added a comment - I meant it is false when blockReadTryCount != 2
          Hide
          Lars Hofhansl added a comment -

          Otherwise looks good. I think the double cache check is fine.

          Show
          Lars Hofhansl added a comment - Otherwise looks good. I think the double cache check is fine.
          Hide
          Elliott Clark added a comment -

          Good catch on that. Here's a patch without the counting.

          Show
          Elliott Clark added a comment - Good catch on that. Here's a patch without the counting.
          Elliott Clark made changes -
          Attachment HBASE-5898-1.patch [ 12550877 ]
          Hide
          Elliott Clark added a comment -

          patch without the counter.

          Show
          Elliott Clark added a comment - patch without the counter.
          Elliott Clark made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12550877/HBASE-5898-1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          -1 javadoc. The javadoc tool appears to have generated 85 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.TestStoreFile

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12550877/HBASE-5898-1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 85 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 findbugs . The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestStoreFile Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3152//console This message is automatically generated.
          Hide
          Lars Hofhansl added a comment -

          Test failure is suspicious

          Show
          Lars Hofhansl added a comment - Test failure is suspicious
          Hide
          ramkrishna.s.vasudevan added a comment -

          We have recently hit this issue.. My major concern here is there is contention only happening right? But here in our case the scan itself did not happen for almost 10 mins?
          The thread dump clearly says what was found over in this JIRA.

          Show
          ramkrishna.s.vasudevan added a comment - We have recently hit this issue.. My major concern here is there is contention only happening right? But here in our case the scan itself did not happen for almost 10 mins? The thread dump clearly says what was found over in this JIRA.
          Hide
          stack added a comment -

          ramkrishna.s.vasudevan How you mean Ram? It was stuck where in particular? Was it a bunch of threads getting same block? What did the thread dump look like? There is some issue in here around the wait/notify it seems as implemented. The double-checked is probably better anyways but could the issue come back just less frequently after this patch goes in?

          Show
          stack added a comment - ramkrishna.s.vasudevan How you mean Ram? It was stuck where in particular? Was it a bunch of threads getting same block? What did the thread dump look like? There is some issue in here around the wait/notify it seems as implemented. The double-checked is probably better anyways but could the issue come back just less frequently after this patch goes in?
          Hide
          ramkrishna.s.vasudevan added a comment -

          I can attach some parts of the thread dump

          Was it a bunch of threads getting same block?

          Yes

          The double-checked is probably better anyways but could the issue come back just less frequently after this patch goes in?

          Am not sure. We tried to restart the client twice still this persisted. Later the RS we restarted after that we could not get this.
          This thing repeats many times. We took 3 thread dumps in a span of 2 mins

          "IPC Server handler 42 on 60020" daemon prio=10 tid=0x00007f2f38f1a000 nid=0x6c4d in Object.wait() [0x00007f2f33e4f000]
             java.lang.Thread.State: WAITING (on object monitor)
          	at java.lang.Object.wait(Native Method)
          	at java.lang.Object.wait(Object.java:485)
          	at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:77)
          	- locked <0x00000006cc2a7178> (a org.apache.hadoop.hbase.util.IdLock$Entry)
          	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:290)
          	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:213)
          	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:455)
          	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:493)
          	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:242)
          	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:167)
          	at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:54)
          	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:299)
          	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:244)
          	at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:523)
          	- locked <0x000000069a665420> (a org.apache.hadoop.hbase.regionserver.StoreScanner)
          	at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:399)
          	- locked <0x000000069a665420> (a org.apache.hadoop.hbase.regionserver.StoreScanner)
          	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:127)
          	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3424)
          	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3379)
          	- locked <0x000000069a7da458> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
          	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3396)
          	- locked <0x000000069a7da458> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
          	at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2411)
          	at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          

          Also we could see that sometimes the relaseLock was also happening. But in the 3 thread dumps this came only once.

          "IPC Server handler 18 on 60020" daemon prio=10 tid=0x00007f2f38ee9800 nid=0x6c35 runnable [0x00007f2f35667000]
             java.lang.Thread.State: RUNNABLE
          	at java.lang.Object.notify(Native Method)
          	at org.apache.hadoop.hbase.util.IdLock.releaseLockEntry(IdLock.java:108)
          	- locked <0x00000006cc2a7178> (a org.apache.hadoop.hbase.util.IdLock$Entry)
          	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
          	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:213)
          	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:455)
          	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:493)
          	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:242)
          	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:167)
          	at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:54)
          	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:299)
          	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:244)
          	at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:523)
          	- locked <0x000000069a89d678> (a org.apache.hadoop.hbase.regionserver.StoreScanner)
          	at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:399)
          	- locked <0x000000069a89d678> (a org.apache.hadoop.hbase.regionserver.StoreScanner)
          	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:127)
          	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3424)
          	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3379)
          	- locked <0x000000069ae0bbb8> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
          	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3396)
          	- locked <0x000000069ae0bbb8> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
          	at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2411)
          
          

          All the client threads were hanging here

          "Thread-29" prio=10 tid=0x00007f9f2c549000 nid=0x639f waiting for monitor entry [0x00007f9f2adec000]
             java.lang.Thread.State: BLOCKED (on object monitor)
          	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:955)
          	- waiting to lock <0x000000078ba82828> (a java.lang.Object)
          	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841)
          	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810)
          	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:942)
          	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:845)
          	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810)
          	at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232)
          	at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:172)
          	at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:131)
          
          
          Show
          ramkrishna.s.vasudevan added a comment - I can attach some parts of the thread dump Was it a bunch of threads getting same block? Yes The double-checked is probably better anyways but could the issue come back just less frequently after this patch goes in? Am not sure. We tried to restart the client twice still this persisted. Later the RS we restarted after that we could not get this. This thing repeats many times. We took 3 thread dumps in a span of 2 mins "IPC Server handler 42 on 60020" daemon prio=10 tid=0x00007f2f38f1a000 nid=0x6c4d in Object .wait() [0x00007f2f33e4f000] java.lang. Thread .State: WAITING (on object monitor) at java.lang. Object .wait(Native Method) at java.lang. Object .wait( Object .java:485) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:77) - locked <0x00000006cc2a7178> (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:290) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:213) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:455) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:493) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:242) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:167) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:54) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:299) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:244) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:523) - locked <0x000000069a665420> (a org.apache.hadoop.hbase.regionserver.StoreScanner) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:399) - locked <0x000000069a665420> (a org.apache.hadoop.hbase.regionserver.StoreScanner) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:127) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3424) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3379) - locked <0x000000069a7da458> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3396) - locked <0x000000069a7da458> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2411) at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) Also we could see that sometimes the relaseLock was also happening. But in the 3 thread dumps this came only once. "IPC Server handler 18 on 60020" daemon prio=10 tid=0x00007f2f38ee9800 nid=0x6c35 runnable [0x00007f2f35667000] java.lang. Thread .State: RUNNABLE at java.lang. Object .notify(Native Method) at org.apache.hadoop.hbase.util.IdLock.releaseLockEntry(IdLock.java:108) - locked <0x00000006cc2a7178> (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:213) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:455) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:493) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:242) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:167) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:54) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:299) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:244) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:523) - locked <0x000000069a89d678> (a org.apache.hadoop.hbase.regionserver.StoreScanner) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:399) - locked <0x000000069a89d678> (a org.apache.hadoop.hbase.regionserver.StoreScanner) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:127) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3424) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3379) - locked <0x000000069ae0bbb8> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3396) - locked <0x000000069ae0bbb8> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2411) All the client threads were hanging here " Thread -29" prio=10 tid=0x00007f9f2c549000 nid=0x639f waiting for monitor entry [0x00007f9f2adec000] java.lang. Thread .State: BLOCKED (on object monitor) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:955) - waiting to lock <0x000000078ba82828> (a java.lang. Object ) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:942) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:845) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:172) at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:131)
          Hide
          stack added a comment -

          I've seen that thread dump before! Elliott Clark has a program to try and repro the above and hopefully we can add some instrumentation and get clues on why the above happens.

          Show
          stack added a comment - I've seen that thread dump before! Elliott Clark has a program to try and repro the above and hopefully we can add some instrumentation and get clues on why the above happens.
          Hide
          Lars Hofhansl added a comment -

          Sounds like a missed notify or a deadlock.
          Although looking at the code I do not see how that can happen. The use of notify (vs. notifyAll) seems correct in IdLock since all waiting threads wait for the same condition and only one thread will be able to proceed.

          @Ram: Which version of HBase?

          Show
          Lars Hofhansl added a comment - Sounds like a missed notify or a deadlock. Although looking at the code I do not see how that can happen. The use of notify (vs. notifyAll) seems correct in IdLock since all waiting threads wait for the same condition and only one thread will be able to proceed. @Ram: Which version of HBase?
          Hide
          Lars Hofhansl added a comment -

          Clearly this can happen when HDFS has a problem. One thread tries to load the block, and if that is delayed due to HDFS all other threads need to queue up and wait.

          Show
          Lars Hofhansl added a comment - Clearly this can happen when HDFS has a problem. One thread tries to load the block, and if that is delayed due to HDFS all other threads need to queue up and wait.
          Hide
          Lars Hofhansl added a comment -

          I think we should commit this fix (after checking out the TestStoreFile failure) and investigate Ram's issue in a different jira, these look like two different scenarios.

          Show
          Lars Hofhansl added a comment - I think we should commit this fix (after checking out the TestStoreFile failure) and investigate Ram's issue in a different jira, these look like two different scenarios.
          Hide
          stack added a comment -

          Lars Hofhansl When you say different scenarios, are you thinking a.) Todd saw contention on this lock and made a workaround – this is what this issue addresses – meanwhile the other scenario b.), is what JD and a few of us saw that looked like weird jvm-bug-like slowness in this same area and this patch works around that issue (Ram saw slowness too)? I'm a little worried that after this patch goes in, b.) will still happen but will just be harder to trip. Would be coolio if could reproduce and poke at it.

          Show
          stack added a comment - Lars Hofhansl When you say different scenarios, are you thinking a.) Todd saw contention on this lock and made a workaround – this is what this issue addresses – meanwhile the other scenario b.), is what JD and a few of us saw that looked like weird jvm-bug-like slowness in this same area and this patch works around that issue (Ram saw slowness too)? I'm a little worried that after this patch goes in, b.) will still happen but will just be harder to trip. Would be coolio if could reproduce and poke at it.
          Hide
          Lars Hofhansl added a comment -

          Something like this. I don't think the patch works around or even affects the 2nd issue.

          The most likely explanation for the 2nd issue seems to be HDFS slowness:
          The block is not in the cache, the first thread tries to load it, and while that is happening all other threads have to (and should) wait.
          If there is a temporary network hickup that will take a bit... And it would look exactly like these stack traces, where many threads are queued up behind this lock.

          Now that you say it, though... On further consideration I am not sure I buy there even is a contention issue here. We either:

          • have the block in the cache, in which case we'll return it very quickly.
          • do not have the block, in that case we have to load it and all other threads must wait.
          Show
          Lars Hofhansl added a comment - Something like this. I don't think the patch works around or even affects the 2nd issue. The most likely explanation for the 2nd issue seems to be HDFS slowness: The block is not in the cache, the first thread tries to load it, and while that is happening all other threads have to (and should) wait. If there is a temporary network hickup that will take a bit... And it would look exactly like these stack traces, where many threads are queued up behind this lock. Now that you say it, though... On further consideration I am not sure I buy there even is a contention issue here. We either: have the block in the cache, in which case we'll return it very quickly. do not have the block, in that case we have to load it and all other threads must wait.
          Hide
          stack added a comment -

          Onus is on us to show you the thread dumps where no thread was going to hdfs.... let me see if can dig up one.

          Show
          stack added a comment - Onus is on us to show you the thread dumps where no thread was going to hdfs.... let me see if can dig up one.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Ram: Which version of HBase?

          0.94.0 + few more patches over it.
          Going thro the code in getLockId and releaseLockId things seems to be pretty much fine there.
          Reg could be an HDFS issue,
          If that is the case the thread dump should have something where the code is waiting for the HDFS right? somewhere in fsReader.readBlockData() or similar area where we try to read from HDFS.

          Show
          ramkrishna.s.vasudevan added a comment - @Ram: Which version of HBase? 0.94.0 + few more patches over it. Going thro the code in getLockId and releaseLockId things seems to be pretty much fine there. Reg could be an HDFS issue, If that is the case the thread dump should have something where the code is waiting for the HDFS right? somewhere in fsReader.readBlockData() or similar area where we try to read from HDFS.
          Hide
          Lars Hofhansl added a comment -

          What should we do with this one?

          @Ram: you're probably right, on the other hand I cannot see anything wrong with this code (and I looked at it multiple times). The proposed patch will probably not fix the problem you saw.

          And from some comment earlier it is doubtful that this patch is even an improvement for the issue that Todd saw. I'd say we should table this (at least for 0.94), until we know what exactly the issue is.

          Show
          Lars Hofhansl added a comment - What should we do with this one? @Ram: you're probably right, on the other hand I cannot see anything wrong with this code (and I looked at it multiple times). The proposed patch will probably not fix the problem you saw. And from some comment earlier it is doubtful that this patch is even an improvement for the issue that Todd saw. I'd say we should table this (at least for 0.94), until we know what exactly the issue is.
          Hide
          Jean-Daniel Cryans added a comment -

          The most likely explanation for the 2nd issue seems to be HDFS slowness

          Improbable. With the patch, the first thread would try to get the block and get stuck, the others would first try to read without the lock, fail, and then would also block. Except that when I applied the patch it resolved the issue.

          Show
          Jean-Daniel Cryans added a comment - The most likely explanation for the 2nd issue seems to be HDFS slowness Improbable. With the patch, the first thread would try to get the block and get stuck, the others would first try to read without the lock, fail, and then would also block. Except that when I applied the patch it resolved the issue.
          Hide
          Lars Hofhansl added a comment -

          You wanna just commit it? It won't do any harm. I'm just doubtful that it will fix anything.

          Show
          Lars Hofhansl added a comment - You wanna just commit it? It won't do any harm. I'm just doubtful that it will fix anything.
          Hide
          ramkrishna.s.vasudevan added a comment -

          I think we can commit this. Surely this will have an impact on performance.
          Just before committing can we check if Ted's comment
          https://issues.apache.org/jira/browse/HBASE-5898?focusedCommentId=13264470&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13264470 could be a problem if offheapcache is enabled.

          Show
          ramkrishna.s.vasudevan added a comment - I think we can commit this. Surely this will have an impact on performance. Just before committing can we check if Ted's comment https://issues.apache.org/jira/browse/HBASE-5898?focusedCommentId=13264470&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13264470 could be a problem if offheapcache is enabled.
          Hide
          Lars Hofhansl added a comment -

          Offtopic: Is the offheap cache even working currently? If not I'd be in favor of ripping that code out.

          Show
          Lars Hofhansl added a comment - Offtopic: Is the offheap cache even working currently? If not I'd be in favor of ripping that code out.
          Hide
          stack added a comment -

          The offheap cache is being used by no one – its unusable in current form – and doubtful anyone would ever use it or if it even makes sense to use it as currently implemented instead of just having bigger block cache w/ decent GC tuning + file system cache instead. I'd be fine for ripping it out until someone wanted to take it up and finish it off.

          Show
          stack added a comment - The offheap cache is being used by no one – its unusable in current form – and doubtful anyone would ever use it or if it even makes sense to use it as currently implemented instead of just having bigger block cache w/ decent GC tuning + file system cache instead. I'd be fine for ripping it out until someone wanted to take it up and finish it off.
          Hide
          stack added a comment -

          Elliott Clark Elliott, you get any more data on the hung up case (Elliott has been trying to repo it so can get more data on slow wait/notifying)

          Show
          stack added a comment - Elliott Clark Elliott, you get any more data on the hung up case (Elliott has been trying to repo it so can get more data on slow wait/notifying)
          Hide
          Lars Hofhansl added a comment -

          Yeah, offheap caching for diskblocks always seemed a bit "interesting" as the blocks all have the same size the GC should be able to deal with the churn nicely.

          Show
          Lars Hofhansl added a comment - Yeah, offheap caching for diskblocks always seemed a bit "interesting" as the blocks all have the same size the GC should be able to deal with the churn nicely.
          Hide
          Lars Hofhansl added a comment -

          I wonder whether HBASE-6032 helps with this.

          Show
          Lars Hofhansl added a comment - I wonder whether HBASE-6032 helps with this.
          Hide
          Lars Hofhansl added a comment -

          Gentlemen, what should we do with this?
          I would like the next 0.94RC soon. Do we want this now, or can it wait a few weeks for the next RC?

          Show
          Lars Hofhansl added a comment - Gentlemen, what should we do with this? I would like the next 0.94RC soon. Do we want this now, or can it wait a few weeks for the next RC?
          Hide
          Jean-Daniel Cryans added a comment -

          If we want to get people testing it I say we include it right now else not at all for 0.94

          Show
          Jean-Daniel Cryans added a comment - If we want to get people testing it I say we include it right now else not at all for 0.94
          Hide
          Lars Hofhansl added a comment -

          Let's commit it, and add to the release notes that this potentially breaks off-heap caching.

          Show
          Lars Hofhansl added a comment - Let's commit it, and add to the release notes that this potentially breaks off-heap caching.
          Hide
          Lars Hofhansl added a comment -

          OK. Somebody who wants this committed should do so. I am +0.

          Show
          Lars Hofhansl added a comment - OK. Somebody who wants this committed should do so. I am +0.
          Hide
          Lars Hofhansl added a comment -

          If nobody is willing to sponsor this (by committing), maybe it should not be committed...?

          Show
          Lars Hofhansl added a comment - If nobody is willing to sponsor this (by committing), maybe it should not be committed...?
          Hide
          stack added a comment -

          This patch was difference between a stalling sick cluster and one that worked properly again. I think it has to be committed (off-heap cache should not be in the way of committing this issue).

          Show
          stack added a comment - This patch was difference between a stalling sick cluster and one that worked properly again. I think it has to be committed (off-heap cache should not be in the way of committing this issue).
          Hide
          Lars Hofhansl added a comment -

          Wanna commit Stack?

          Show
          Lars Hofhansl added a comment - Wanna commit Stack?
          Hide
          Lars Hofhansl added a comment -

          In the light of HBASE-6852, even saving taking the lock if we already have cached the block, might lead to a nice performance improvement.

          Show
          Lars Hofhansl added a comment - In the light of HBASE-6852 , even saving taking the lock if we already have cached the block, might lead to a nice performance improvement.
          Hide
          stack added a comment -

          Retry

          Show
          stack added a comment - Retry
          stack made changes -
          Attachment HBASE-5898-1.patch [ 12552544 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12552544/HBASE-5898-1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3258//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552544/HBASE-5898-1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3258//console This message is automatically generated.
          Hide
          Elliott Clark added a comment -

          Looks like the patch has gone stale after the big metrics2 commit. I'll get a new version up for trunk in a little bit.

          Show
          Elliott Clark added a comment - Looks like the patch has gone stale after the big metrics2 commit. I'll get a new version up for trunk in a little bit.
          Hide
          Lars Hofhansl added a comment -

          Rebased trunk patch. (want to get 0.94.3 RC out)

          Show
          Lars Hofhansl added a comment - Rebased trunk patch. (want to get 0.94.3 RC out)
          Lars Hofhansl made changes -
          Attachment 5898-v2.txt [ 12553046 ]
          Hide
          Lars Hofhansl added a comment -

          Oops. No, not quite the same.

          Show
          Lars Hofhansl added a comment - Oops. No, not quite the same.
          Lars Hofhansl made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Lars Hofhansl added a comment -

          Arggh... NM. This is the right patch.

          Show
          Lars Hofhansl added a comment - Arggh... NM. This is the right patch.
          Lars Hofhansl made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Lars Hofhansl added a comment -

          Figured my confusion. Took the patch from the wrong directory. This version is good now.

          Show
          Lars Hofhansl added a comment - Figured my confusion. Took the patch from the wrong directory. This version is good now.
          Lars Hofhansl made changes -
          Attachment 5898-v2.txt [ 12553047 ]
          Hide
          Lars Hofhansl added a comment -

          I can't make patches.

          Show
          Lars Hofhansl added a comment - I can't make patches.
          Lars Hofhansl made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Lars Hofhansl made changes -
          Attachment 5898-v2.txt [ 12553047 ]
          Lars Hofhansl made changes -
          Attachment 5898-v2.txt [ 12553046 ]
          Hide
          Lars Hofhansl added a comment -

          Last try.

          Show
          Lars Hofhansl added a comment - Last try.
          Lars Hofhansl made changes -
          Attachment 5898-v2.txt [ 12553048 ]
          Lars Hofhansl made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12553048/5898-v2.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          -1 javadoc. The javadoc tool appears to have generated 87 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 17 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.TestStoreFile

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553048/5898-v2.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 87 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 findbugs . The patch appears to introduce 17 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestStoreFile Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3307//console This message is automatically generated.
          Hide
          Lars Hofhansl added a comment -

          TestStoreFile fails because it checks CacheStats.getMissCount(), which will be wrong even after modified patch. CacheStats.miss updates two counters: missCount and missCachingCount. Only missCachingCount is counted correctly.
          So, we can change the test and accept the wrong number for missCount, or we change the CacheStats API.

          Show
          Lars Hofhansl added a comment - TestStoreFile fails because it checks CacheStats.getMissCount(), which will be wrong even after modified patch. CacheStats.miss updates two counters: missCount and missCachingCount. Only missCachingCount is counted correctly. So, we can change the test and accept the wrong number for missCount, or we change the CacheStats API.
          Hide
          Lars Hofhansl added a comment -

          Version that fixes TestStoreFile (by using getMissCachingCount() instead of getMissCount())

          The missCount is still over counted.

          Show
          Lars Hofhansl added a comment - Version that fixes TestStoreFile (by using getMissCachingCount() instead of getMissCount()) The missCount is still over counted.
          Lars Hofhansl made changes -
          Attachment 5898-v3.txt [ 12553057 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12553057/5898-v3.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          -1 javadoc. The javadoc tool appears to have generated 87 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 17 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestFromClientSide
          org.apache.hadoop.hbase.client.TestFromClientSideWithCoprocessor

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553057/5898-v3.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 87 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 findbugs . The patch appears to introduce 17 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.client.TestFromClientSide org.apache.hadoop.hbase.client.TestFromClientSideWithCoprocessor Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3308//console This message is automatically generated.
          Hide
          stack added a comment -

          Seems fine by me Lars.

          Show
          stack added a comment - Seems fine by me Lars.
          Hide
          Lars Hofhansl added a comment -

          OK... Will commit in the next hour.

          Show
          Lars Hofhansl added a comment - OK... Will commit in the next hour.
          Hide
          Lars Hofhansl added a comment -

          Actually... Which metric is measures via JMX? It would be bad if that metric would be double reported now via JMX.
          This is last jira open against 0.94.3.

          Show
          Lars Hofhansl added a comment - Actually... Which metric is measures via JMX? It would be bad if that metric would be double reported now via JMX. This is last jira open against 0.94.3.
          Hide
          Lars Hofhansl added a comment -

          Yeah, just verified that RegionServerMetrics.blockCacheMissCount is driven by CacheStats.missCount, which will be double counted with this patch.

          Show
          Lars Hofhansl added a comment - Yeah, just verified that RegionServerMetrics.blockCacheMissCount is driven by CacheStats.missCount, which will be double counted with this patch.
          Hide
          Lars Hofhansl added a comment -

          BTW. It also turns out that HFileReaderVx update an AtomicLong (cacheHits) that is not read anywhere...?! I'll remove that.

          Show
          Lars Hofhansl added a comment - BTW. It also turns out that HFileReaderVx update an AtomicLong (cacheHits) that is not read anywhere...?! I'll remove that.
          Hide
          Lars Hofhansl added a comment -

          same for metaLoads and blockLoads. Both are atomic longs updated for no reason.

          Show
          Lars Hofhansl added a comment - same for metaLoads and blockLoads. Both are atomic longs updated for no reason.
          Hide
          Lars Hofhansl added a comment -

          Patch that handled the cache misses correctly. Also removes the unnecessary AtomicLong updates.

          Not pretty... BlockCache.getBlock has another boolean parameter to indicate whether cache misses should be counted... All callers and implementors needed to be changed.

          Please let me know what you think.

          Show
          Lars Hofhansl added a comment - Patch that handled the cache misses correctly. Also removes the unnecessary AtomicLong updates. Not pretty... BlockCache.getBlock has another boolean parameter to indicate whether cache misses should be counted... All callers and implementors needed to be changed. Please let me know what you think.
          Lars Hofhansl made changes -
          Attachment 5898-v4.txt [ 12553215 ]
          Hide
          Lars Hofhansl added a comment -

          Any comments? Would like to commit and do a 0.94.3rc.

          Show
          Lars Hofhansl added a comment - Any comments? Would like to commit and do a 0.94.3rc.
          Lars Hofhansl made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Lars Hofhansl made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          stack added a comment -

          The repeat is a bit strange but it is at least easy to follow what is going on in this block getting code so +1 on commit for 0.94. No harm in beefing up this comment on commit:

          +   * @param repeat Whether this is a repeat lookup for the same block
          +   *        {@see HFileReaderV2#readBlock(long, long, boolean, boolean, boolean, BlockType)}
          

          You refer to the code responsible for this 'repeat' param but maybe make mention of not wanting to double-count metrics when doing double-check locking in the cited code?

          Good on you Lars.

          Show
          stack added a comment - The repeat is a bit strange but it is at least easy to follow what is going on in this block getting code so +1 on commit for 0.94. No harm in beefing up this comment on commit: + * @param repeat Whether this is a repeat lookup for the same block + * {@see HFileReaderV2#readBlock( long , long , boolean , boolean , boolean , BlockType)} You refer to the code responsible for this 'repeat' param but maybe make mention of not wanting to double-count metrics when doing double-check locking in the cited code? Good on you Lars.
          Hide
          Lars Hofhansl added a comment -

          Will update the comment. I would like to get a HadoopQA run through, but it's currently broken. (possibly related to HBASE-7104)

          Show
          Lars Hofhansl added a comment - Will update the comment. I would like to get a HadoopQA run through, but it's currently broken. (possibly related to HBASE-7104 )
          Hide
          Lars Hofhansl added a comment -

          Attaching same patch again for HadoopQA

          Show
          Lars Hofhansl added a comment - Attaching same patch again for HadoopQA
          Lars Hofhansl made changes -
          Attachment 5898-v4.txt [ 12553266 ]
          Lars Hofhansl made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Lars Hofhansl made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Lars Hofhansl added a comment -

          Triggered HadoopQA directly through jenkins.

          Show
          Lars Hofhansl added a comment - Triggered HadoopQA directly through jenkins.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12553266/5898-v4.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 15 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          -1 javadoc. The javadoc tool appears to have generated 93 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 17 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestShell

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553266/5898-v4.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 15 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 93 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 findbugs . The patch appears to introduce 17 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.client.TestShell Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3320//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12553266/5898-v4.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 15 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          -1 javadoc. The javadoc tool appears to have generated 93 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 17 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestShell
          org.apache.hadoop.hbase.TestDrainingServer

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553266/5898-v4.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 15 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 93 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 findbugs . The patch appears to introduce 17 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.client.TestShell org.apache.hadoop.hbase.TestDrainingServer Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3321//console This message is automatically generated.
          Hide
          Lars Hofhansl added a comment -

          TestDrainingServer passes locally. TestShell fails with or without this patch.

          Show
          Lars Hofhansl added a comment - TestDrainingServer passes locally. TestShell fails with or without this patch.
          Lars Hofhansl made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Lars Hofhansl added a comment -

          0.94 version of the patch

          Show
          Lars Hofhansl added a comment - 0.94 version of the patch
          Lars Hofhansl made changes -
          Attachment 5898-0.94.txt [ 12553277 ]
          Hide
          Lars Hofhansl added a comment -

          Committed to 0.94 and 0.96.
          (Removal of those unneeded AtomicLongs will give some additional cycles back)

          Show
          Lars Hofhansl added a comment - Committed to 0.94 and 0.96. (Removal of those unneeded AtomicLongs will give some additional cycles back)
          Lars Hofhansl made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #582 (See https://builds.apache.org/job/HBase-0.94/582/)
          HBASE-5898 Consider double-checked locking for block cache lock (Todd, Elliot, LarsH) (Revision 1408621)

          Result = FAILURE
          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/DoubleBlockCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SingleSizeCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SlabCache.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestCacheOnWriteInSchema.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #582 (See https://builds.apache.org/job/HBase-0.94/582/ ) HBASE-5898 Consider double-checked locking for block cache lock (Todd, Elliot, LarsH) (Revision 1408621) Result = FAILURE larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/DoubleBlockCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SingleSizeCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SlabCache.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestCacheOnWriteInSchema.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #3534 (See https://builds.apache.org/job/HBase-TRUNK/3534/)
          HBASE-5898 Consider double-checked locking for block cache lock (Todd, Elliot, LarsH) (Revision 1408620)

          Result = FAILURE
          larsh :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/DoubleBlockCache.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SingleSizeCache.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SlabCache.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCacheOnWriteInSchema.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #3534 (See https://builds.apache.org/job/HBase-TRUNK/3534/ ) HBASE-5898 Consider double-checked locking for block cache lock (Todd, Elliot, LarsH) (Revision 1408620) Result = FAILURE larsh : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/DoubleBlockCache.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SingleSizeCache.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SlabCache.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCacheOnWriteInSchema.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #83 (See https://builds.apache.org/job/HBase-0.94-security/83/)
          HBASE-5898 Consider double-checked locking for block cache lock (Todd, Elliot, LarsH) (Revision 1408621)

          Result = SUCCESS
          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/DoubleBlockCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SingleSizeCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SlabCache.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestCacheOnWriteInSchema.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security #83 (See https://builds.apache.org/job/HBase-0.94-security/83/ ) HBASE-5898 Consider double-checked locking for block cache lock (Todd, Elliot, LarsH) (Revision 1408621) Result = SUCCESS larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/DoubleBlockCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SingleSizeCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SlabCache.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestCacheOnWriteInSchema.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #258 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/258/)
          HBASE-5898 Consider double-checked locking for block cache lock (Todd, Elliot, LarsH) (Revision 1408620)

          Result = FAILURE
          larsh :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/DoubleBlockCache.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SingleSizeCache.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SlabCache.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCacheOnWriteInSchema.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #258 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/258/ ) HBASE-5898 Consider double-checked locking for block cache lock (Todd, Elliot, LarsH) (Revision 1408620) Result = FAILURE larsh : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/DoubleBlockCache.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SingleSizeCache.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SlabCache.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCacheOnWriteInSchema.java
          Lars Hofhansl made changes -
          Link This issue relates to HBASE-7160 [ HBASE-7160 ]
          Lars Hofhansl made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Lars Hofhansl added a comment -

          One interesting behavior I saw was that HBase grinds to a halt when many threads scan along the same set of blocks and read short circuit is not enabled for the dfs client.
          This seems to be a problem in the dfs client (1.0.3 and 1.0.4).

          Have no time right now to investigate, but thought that might give a hint.

          Show
          Lars Hofhansl added a comment - One interesting behavior I saw was that HBase grinds to a halt when many threads scan along the same set of blocks and read short circuit is not enabled for the dfs client. This seems to be a problem in the dfs client (1.0.3 and 1.0.4). Have no time right now to investigate, but thought that might give a hint.
          Hide
          Lars Hofhansl added a comment -

          I tracked it down to locking on istream in HFileBlock.readAtOffset.

          Here's my test case: 20m rows, single column family, single column, blockcache disabled for the scan, all data fits into the OS buffer cache.
          Full scan over all rows.

          One client scanning: 15s (regionserver keeps one core busy ~120%)
          Two clients scanning along the same set of blocks: They both time out. (regionserver is a 5-6% CPU, clearly just waiting)

          Then I changed readAtOffset to always do preads. Now:
          One client scanning: 39s (regionserver at ~120%)
          Two clients scanning: 39s each (regionserver at ~210%)

          So not sure how to proceed. Generally switching to pread obviously does not work (scan time almost tripled).
          seek + read does not scale to multiple threads. I bet this is the issue folks have been seeing and this patch does not address that.

          Show
          Lars Hofhansl added a comment - I tracked it down to locking on istream in HFileBlock.readAtOffset. Here's my test case: 20m rows, single column family, single column, blockcache disabled for the scan, all data fits into the OS buffer cache. Full scan over all rows. One client scanning: 15s (regionserver keeps one core busy ~120%) Two clients scanning along the same set of blocks: They both time out. (regionserver is a 5-6% CPU, clearly just waiting) Then I changed readAtOffset to always do preads. Now: One client scanning: 39s (regionserver at ~120%) Two clients scanning: 39s each (regionserver at ~210%) So not sure how to proceed. Generally switching to pread obviously does not work (scan time almost tripled). seek + read does not scale to multiple threads. I bet this is the issue folks have been seeing and this patch does not address that.
          Hide
          Lars Hofhansl added a comment -

          Why is there only one reader/fsinput stream per store file?

          Show
          Lars Hofhansl added a comment - Why is there only one reader/fsinput stream per store file?
          Hide
          Lars Hofhansl added a comment -

          When read short circuiting is enabled:

          One client: 15s (120% CPU)
          Two clients: 41s each (160% CPU)

          with pread:
          One client: 18s (160% CPU)
          Two clients: 19s each (250% CPU)

          Presumably local read is fast enough in this case to make the synchronization less of a problem.

          Show
          Lars Hofhansl added a comment - When read short circuiting is enabled: One client: 15s (120% CPU) Two clients: 41s each (160% CPU) with pread: One client: 18s (160% CPU) Two clients: 19s each (250% CPU) Presumably local read is fast enough in this case to make the synchronization less of a problem.
          Hide
          Lars Hofhansl added a comment -

          Last comment (I promise). Note: That this all with the block cache disabled.

          Show
          Lars Hofhansl added a comment - Last comment (I promise). Note: That this all with the block cache disabled.
          Hide
          Shrijeet Paliwal added a comment -

          Hello Lars,

          Do not mean to sidetrack you but I raised http://markmail.org/thread/ulki4uccwb43ahdj , what you are describing still seem to fall in #2 of my list. What do you think?

          Just wanted to link mailing list discussion here.

          Show
          Shrijeet Paliwal added a comment - Hello Lars, Do not mean to sidetrack you but I raised http://markmail.org/thread/ulki4uccwb43ahdj , what you are describing still seem to fall in #2 of my list. What do you think? Just wanted to link mailing list discussion here.
          Hide
          Lars Hofhansl added a comment -

          Possibly. This will very much look like a deadlock. Scanners will time out, and it will look like the system has ground to a halt.
          This will be exasperated by large store files. Imagine a 20GB store file and only one reader at a time allowed on it.

          I think we can have a adhoc fix where attempt seek + read and if the istream is locked we'll switch to pread. That way we get seek + read when possible and pread when necessary.

          There also might be a mysterious deadlock, but until I see a jstack or am shown code that causes it I'll be skeptical.

          Show
          Lars Hofhansl added a comment - Possibly. This will very much look like a deadlock. Scanners will time out, and it will look like the system has ground to a halt. This will be exasperated by large store files. Imagine a 20GB store file and only one reader at a time allowed on it. I think we can have a adhoc fix where attempt seek + read and if the istream is locked we'll switch to pread. That way we get seek + read when possible and pread when necessary. There also might be a mysterious deadlock, but until I see a jstack or am shown code that causes it I'll be skeptical.
          Hide
          Shrijeet Paliwal added a comment -

          Lars, I had put the stack trace here https://gist.github.com/4261746. It seems my issue is indeed a deadlock but its being triggered via a jvm bug http://bugs.sun.com/view_bug.do?bug_id=6822370 .

          We recently revisited our GC opts and removed UseMembar flag (did not seem useful). I guess any HBase install running java < 6u21 will potentially hit this deadlock. HBASE-3622 touched upon it.

          Thanks for your attention.

          Show
          Shrijeet Paliwal added a comment - Lars, I had put the stack trace here https://gist.github.com/4261746 . It seems my issue is indeed a deadlock but its being triggered via a jvm bug http://bugs.sun.com/view_bug.do?bug_id=6822370 . We recently revisited our GC opts and removed UseMembar flag (did not seem useful). I guess any HBase install running java < 6u21 will potentially hit this deadlock. HBASE-3622 touched upon it. Thanks for your attention.
          Hide
          Lars Hofhansl added a comment -

          Sorry... Missed the jstack there. I can see that all handlers are waiting to get the lock to load a block, but not where the lock is held. If HDFS is slow that would happen, if there was a deadlock that would also happen. I agree since it appears that all handlers are blocked this could be a deadlock.

          Show
          Lars Hofhansl added a comment - Sorry... Missed the jstack there. I can see that all handlers are waiting to get the lock to load a block, but not where the lock is held. If HDFS is slow that would happen, if there was a deadlock that would also happen. I agree since it appears that all handlers are blocked this could be a deadlock.
          Hide
          Elliott Clark added a comment -

          That's what we saw too (I can't remember the java version). All the handlers were waiting for the lock and nothing was reading from hdfs. We had guessed either a cpu bug or a java bug.

          Show
          Elliott Clark added a comment - That's what we saw too (I can't remember the java version). All the handlers were waiting for the lock and nothing was reading from hdfs. We had guessed either a cpu bug or a java bug.
          Hide
          stack added a comment -

          Lars Hofhansl Can we make a new issue w/ your important finding? Need to fix. Do something like "...adhoc fix where attempt seek + read and if the istream is locked we'll switch to pread. That way we get seek + read when possible and pread when necessary." Its pretty critical I'd say. Opening a pool of readers would be a PITA.

          Show
          stack added a comment - Lars Hofhansl Can we make a new issue w/ your important finding? Need to fix. Do something like "...adhoc fix where attempt seek + read and if the istream is locked we'll switch to pread. That way we get seek + read when possible and pread when necessary." Its pretty critical I'd say. Opening a pool of readers would be a PITA.
          Hide
          Lars Hofhansl added a comment -

          Was just about to file a new bug. Will repost my findings there.

          Show
          Lars Hofhansl added a comment - Was just about to file a new bug. Will repost my findings there.
          Lars Hofhansl made changes -
          Link This issue is related to HBASE-7336 [ HBASE-7336 ]
          Hide
          ramkrishna.s.vasudevan added a comment -

          Agree with Elliot here. It was waiting for a lock and not to read from hdfs.

          Show
          ramkrishna.s.vasudevan added a comment - Agree with Elliot here. It was waiting for a lock and not to read from hdfs.
          Hide
          Shrijeet Paliwal added a comment -

          After putting UseMembar in GC opts we have not seen the previously reported deadlock. Just wanted to update.

          Show
          Shrijeet Paliwal added a comment - After putting UseMembar in GC opts we have not seen the previously reported deadlock. Just wanted to update.
          Hide
          Lars Hofhansl added a comment -

          Thanks Shrijeet. How reliably have you seen this before (Once a day, once a mongth, etc)?

          If this really causes issues we should:

          1. ship with -XX:+UseMembar by default in hbase-env.sh
          2. document that this must be set

          Are we confident enough this to do that?

          Show
          Lars Hofhansl added a comment - Thanks Shrijeet. How reliably have you seen this before (Once a day, once a mongth, etc)? If this really causes issues we should: ship with -XX:+UseMembar by default in hbase-env.sh document that this must be set Are we confident enough this to do that?
          Hide
          Shrijeet Paliwal added a comment -

          Lars, we saw this twice in two days (once each day, on different region servers).
          It was been 36 hours since we put -XX:+UseMembar back & restarted region servers, have not seen again yet.

          Stack documented this as part of HBASE-3622, http://hbase.apache.org/book/trouble.rs.html

          Show
          Shrijeet Paliwal added a comment - Lars, we saw this twice in two days (once each day, on different region servers). It was been 36 hours since we put -XX:+UseMembar back & restarted region servers, have not seen again yet. Stack documented this as part of HBASE-3622 , http://hbase.apache.org/book/trouble.rs.html
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security-on-Hadoop-23 #10 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/10/)
          HBASE-5898 Consider double-checked locking for block cache lock (Todd, Elliot, LarsH) (Revision 1408621)

          Result = FAILURE
          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/DoubleBlockCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SingleSizeCache.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SlabCache.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestCacheOnWriteInSchema.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security-on-Hadoop-23 #10 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/10/ ) HBASE-5898 Consider double-checked locking for block cache lock (Todd, Elliot, LarsH) (Revision 1408621) Result = FAILURE larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/DoubleBlockCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SingleSizeCache.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SlabCache.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestCacheOnWriteInSchema.java
          stack made changes -
          Fix Version/s 0.95.0 [ 12324094 ]
          Fix Version/s 0.96.0 [ 12320040 ]
          Fix Version/s 0.94.3 [ 12323144 ]
          Lars Hofhansl made changes -
          Fix Version/s 0.94.3 [ 12323144 ]

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development