Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.92.0
    • Fix Version/s: 0.92.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      A few races are still lingering in the slab cache. Here are some tests and proposed fixes.

      1. hbase-4330v7.txt
        21 kB
        Li Pi
      2. hbase-4330v6.txt
        20 kB
        Li Pi
      3. hbase-4330v5.txt
        19 kB
        Li Pi
      4. hbase-4330v4.txt
        19 kB
        Li Pi
      5. hbase-4330v3.txt
        1.30 MB
        Li Pi
      6. hbase-4330.txt
        18 kB
        Li Pi
      7. hbase-4330.txt
        13 kB
        Todd Lipcon
      There are no Sub-Tasks for this issue.

        Activity

        Hide
        Todd Lipcon added a comment -

        Some new tests and some fixes, let me know what you think

        Show
        Todd Lipcon added a comment - Some new tests and some fixes, let me know what you think
        Hide
        Ted Yu added a comment -

        When I ran TestSingleSizeCache, it seemed to hang.
        Here is part of jstack:
        http://pastebin.com/juL5ezzt

        Show
        Ted Yu added a comment - When I ran TestSingleSizeCache, it seemed to hang. Here is part of jstack: http://pastebin.com/juL5ezzt
        Hide
        Li Pi added a comment -

        Do we really need:
        contentBlock.serializedData.rewind();?
        As we always do a ByteBuffer.duplicate later.

        Show
        Li Pi added a comment - Do we really need: contentBlock.serializedData.rewind();? As we always do a ByteBuffer.duplicate later.
        Hide
        Li Pi added a comment -

        Also, whats the advantage of using a synchronize block and checking whether eviction has been completed yet vs using a RWL?

        Show
        Li Pi added a comment - Also, whats the advantage of using a synchronize block and checking whether eviction has been completed yet vs using a RWL?
        Hide
        Li Pi added a comment -

        Spoke to todd on these changes. +1. Some races remain, but I'm hunting them down.

        Show
        Li Pi added a comment - Spoke to todd on these changes. +1. Some races remain, but I'm hunting them down.
        Hide
        Todd Lipcon added a comment -

        Let's not commit til we get all of them figured out. No sense having 5 jiras all entitled "Fix races in slab cache"

        Show
        Todd Lipcon added a comment - Let's not commit til we get all of them figured out. No sense having 5 jiras all entitled "Fix races in slab cache"
        Hide
        Li Pi added a comment -

        Okay

        @Ted Yu - I can't recreate the hanging TestSingleSizeCache. Every once a while mvn fails to run any tests after looping, but I've determined thats mvn being mvn.

        Show
        Li Pi added a comment - Okay @Ted Yu - I can't recreate the hanging TestSingleSizeCache. Every once a while mvn fails to run any tests after looping, but I've determined thats mvn being mvn.
        Hide
        Li Pi added a comment -

        Ran 3 instances of the tests in a loop for 24 hours+. No errors.

        mvn crashed many times, failing to run the tests at all, however. I never managed to get it to hang.

        Show
        Li Pi added a comment - Ran 3 instances of the tests in a loop for 24 hours+. No errors. mvn crashed many times, failing to run the tests at all, however. I never managed to get it to hang.
        Hide
        Ted Yu added a comment -

        How long did TestSingleSizeCache take, on average ?
        Thanks

        Show
        Ted Yu added a comment - How long did TestSingleSizeCache take, on average ? Thanks
        Hide
        Li Pi added a comment -

        The final test, TestCacheMultiThreadedEvictions, took around 30 minutes. Its not hanging - it's just taking its sweet time. I'm investigating this right now.

        Show
        Li Pi added a comment - The final test, TestCacheMultiThreadedEvictions, took around 30 minutes. Its not hanging - it's just taking its sweet time. I'm investigating this right now.
        Hide
        Li Pi added a comment -

        fixed evictor resource starvation. Removed spinlock.

        Spinlock, with enough threads, was starving the evictionthread of cycles. This causes the tests to run extremely slowly, giving the appearance of a hang.

        Show
        Li Pi added a comment - fixed evictor resource starvation. Removed spinlock. Spinlock, with enough threads, was starving the evictionthread of cycles. This causes the tests to run extremely slowly, giving the appearance of a hang.
        Hide
        Ted Yu added a comment -

        Please fix the following:

        [INFO] Compilation failure
        
        /home/hadoop/hbase/src/test/java/org/apache/hadoop/hbase/io/hfile/slab/TestSlab.java:[47,33] unreported exception java.lang.InterruptedException; must be caught or declared to be thrown
        
        /home/hadoop/hbase/src/test/java/org/apache/hadoop/hbase/io/hfile/slab/TestSlab.java:[67,33] unreported exception java.lang.InterruptedException; must be caught or declared to be thrown
        
        Show
        Ted Yu added a comment - Please fix the following: [INFO] Compilation failure /home/hadoop/hbase/src/test/java/org/apache/hadoop/hbase/io/hfile/slab/TestSlab.java:[47,33] unreported exception java.lang.InterruptedException; must be caught or declared to be thrown /home/hadoop/hbase/src/test/java/org/apache/hadoop/hbase/io/hfile/slab/TestSlab.java:[67,33] unreported exception java.lang.InterruptedException; must be caught or declared to be thrown
        Hide
        Li Pi added a comment -

        Done.

        Show
        Li Pi added a comment - Done.
        Hide
        Ted Yu added a comment -

        Patch v3 contains way too many changes.
        Can you rebase and produce a cleaner patch ?

        Thanks

        Show
        Ted Yu added a comment - Patch v3 contains way too many changes. Can you rebase and produce a cleaner patch ? Thanks
        Hide
        Li Pi added a comment -

        Woah, not sure what happened there. Fixing.

        Show
        Li Pi added a comment - Woah, not sure what happened there. Fixing.
        Hide
        Li Pi added a comment -

        rebased.

        Show
        Li Pi added a comment - rebased.
        Hide
        Ted Yu added a comment -
        Running org.apache.hadoop.hbase.io.hfile.slab.TestSlabCache
        
        Results :
        
        Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
        
        [INFO] ------------------------------------------------------------------------
        [ERROR] BUILD ERROR
        [INFO] ------------------------------------------------------------------------
        [INFO] Failure or timeout
        [INFO] ------------------------------------------------------------------------
        [INFO] For more information, run Maven with the -e switch
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time: 15 minutes 5 seconds
        

        Here is the jstack: http://pastebin.com/vDCBMyrq
        Here is the OS:

        Linux us01.ciq.com 2.6.18-194.32.1.el5 #1 SMP Wed Jan 5 17:52:25 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
        
        Show
        Ted Yu added a comment - Running org.apache.hadoop.hbase.io.hfile.slab.TestSlabCache Results : Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 [INFO] ------------------------------------------------------------------------ [ERROR] BUILD ERROR [INFO] ------------------------------------------------------------------------ [INFO] Failure or timeout [INFO] ------------------------------------------------------------------------ [INFO] For more information, run Maven with the -e switch [INFO] ------------------------------------------------------------------------ [INFO] Total time: 15 minutes 5 seconds Here is the jstack: http://pastebin.com/vDCBMyrq Here is the OS: Linux us01.ciq.com 2.6.18-194.32.1.el5 #1 SMP Wed Jan 5 17:52:25 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
        Hide
        Li Pi added a comment -

        shuts down the scheduledThreadPool now

        Show
        Li Pi added a comment - shuts down the scheduledThreadPool now
        Hide
        Li Pi added a comment -

        This should fix the problem as reported in your stack trace. A non daemon thread was never ended, therefore the test never completed.

        Show
        Li Pi added a comment - This should fix the problem as reported in your stack trace. A non daemon thread was never ended, therefore the test never completed.
        Hide
        Ted Yu added a comment -

        We're getting close.
        The three tests passed on Linux.
        But on MacBook:

        testCacheMultiThreadedEviction(org.apache.hadoop.hbase.io.hfile.slab.TestSlabCache)  Time elapsed: 44.652 sec  <<< ERROR!
        java.lang.RuntimeException: Deferred
                at org.apache.hadoop.hbase.MultithreadedTestUtil$TestContext.checkException(MultithreadedTestUtil.java:76)
                at org.apache.hadoop.hbase.MultithreadedTestUtil$TestContext.stop(MultithreadedTestUtil.java:97)
                at org.apache.hadoop.hbase.io.hfile.CacheTestUtils.hammerEviction(CacheTestUtils.java:208)
                at org.apache.hadoop.hbase.io.hfile.slab.TestSlabCache.testCacheMultiThreadedEviction(TestSlabCache.java:87)
        ...
        Caused by: java.lang.RuntimeException: already cached key_2_3
                at org.apache.hadoop.hbase.io.hfile.slab.SingleSizeCache.cacheBlock(SingleSizeCache.java:132)
                at org.apache.hadoop.hbase.io.hfile.slab.SlabCache.cacheBlock(SlabCache.java:207)
                at org.apache.hadoop.hbase.io.hfile.CacheTestUtils$3.doAnAction(CacheTestUtils.java:194)
                at org.apache.hadoop.hbase.MultithreadedTestUtil$RepeatingTestThread.doWork(MultithreadedTestUtil.java:139)
                at org.apache.hadoop.hbase.MultithreadedTestUtil$TestThread.run(MultithreadedTestUtil.java:115)
        

        Here is OS info:

        Darwin tyumac.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386 i386
        
        Show
        Ted Yu added a comment - We're getting close. The three tests passed on Linux. But on MacBook: testCacheMultiThreadedEviction(org.apache.hadoop.hbase.io.hfile.slab.TestSlabCache) Time elapsed: 44.652 sec <<< ERROR! java.lang.RuntimeException: Deferred at org.apache.hadoop.hbase.MultithreadedTestUtil$TestContext.checkException(MultithreadedTestUtil.java:76) at org.apache.hadoop.hbase.MultithreadedTestUtil$TestContext.stop(MultithreadedTestUtil.java:97) at org.apache.hadoop.hbase.io.hfile.CacheTestUtils.hammerEviction(CacheTestUtils.java:208) at org.apache.hadoop.hbase.io.hfile.slab.TestSlabCache.testCacheMultiThreadedEviction(TestSlabCache.java:87) ... Caused by: java.lang.RuntimeException: already cached key_2_3 at org.apache.hadoop.hbase.io.hfile.slab.SingleSizeCache.cacheBlock(SingleSizeCache.java:132) at org.apache.hadoop.hbase.io.hfile.slab.SlabCache.cacheBlock(SlabCache.java:207) at org.apache.hadoop.hbase.io.hfile.CacheTestUtils$3.doAnAction(CacheTestUtils.java:194) at org.apache.hadoop.hbase.MultithreadedTestUtil$RepeatingTestThread.doWork(MultithreadedTestUtil.java:139) at org.apache.hadoop.hbase.MultithreadedTestUtil$TestThread.run(MultithreadedTestUtil.java:115) Here is OS info: Darwin tyumac.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386 i386
        Hide
        Li Pi added a comment -

        Fixed race condition leading to the test failure.

        Show
        Li Pi added a comment - Fixed race condition leading to the test failure.
        Hide
        Ted Yu added a comment -

        I don't see much difference for patch v6 on my MacBook:

        testCacheMultiThreadedEviction(org.apache.hadoop.hbase.io.hfile.slab.TestSlabCache)  Time elapsed: 23.649 sec  <<< ERROR!
        java.lang.RuntimeException: Deferred
                at org.apache.hadoop.hbase.MultithreadedTestUtil$TestContext.checkException(MultithreadedTestUtil.java:76)
                at org.apache.hadoop.hbase.MultithreadedTestUtil$TestContext.stop(MultithreadedTestUtil.java:97)
                at org.apache.hadoop.hbase.io.hfile.CacheTestUtils.hammerEviction(CacheTestUtils.java:211)
                at org.apache.hadoop.hbase.io.hfile.slab.TestSlabCache.testCacheMultiThreadedEviction(TestSlabCache.java:87)
        ...
        Caused by: java.lang.RuntimeException: already cached key_8_9
                at org.apache.hadoop.hbase.io.hfile.slab.SingleSizeCache.cacheBlock(SingleSizeCache.java:132)
                at org.apache.hadoop.hbase.io.hfile.slab.SlabCache.cacheBlock(SlabCache.java:207)
                at org.apache.hadoop.hbase.io.hfile.CacheTestUtils$3.doAnAction(CacheTestUtils.java:197)
                at org.apache.hadoop.hbase.MultithreadedTestUtil$RepeatingTestThread.doWork(MultithreadedTestUtil.java:139)
                at org.apache.hadoop.hbase.MultithreadedTestUtil$TestThread.run(MultithreadedTestUtil.java:115)
        
        Show
        Ted Yu added a comment - I don't see much difference for patch v6 on my MacBook: testCacheMultiThreadedEviction(org.apache.hadoop.hbase.io.hfile.slab.TestSlabCache) Time elapsed: 23.649 sec <<< ERROR! java.lang.RuntimeException: Deferred at org.apache.hadoop.hbase.MultithreadedTestUtil$TestContext.checkException(MultithreadedTestUtil.java:76) at org.apache.hadoop.hbase.MultithreadedTestUtil$TestContext.stop(MultithreadedTestUtil.java:97) at org.apache.hadoop.hbase.io.hfile.CacheTestUtils.hammerEviction(CacheTestUtils.java:211) at org.apache.hadoop.hbase.io.hfile.slab.TestSlabCache.testCacheMultiThreadedEviction(TestSlabCache.java:87) ... Caused by: java.lang.RuntimeException: already cached key_8_9 at org.apache.hadoop.hbase.io.hfile.slab.SingleSizeCache.cacheBlock(SingleSizeCache.java:132) at org.apache.hadoop.hbase.io.hfile.slab.SlabCache.cacheBlock(SlabCache.java:207) at org.apache.hadoop.hbase.io.hfile.CacheTestUtils$3.doAnAction(CacheTestUtils.java:197) at org.apache.hadoop.hbase.MultithreadedTestUtil$RepeatingTestThread.doWork(MultithreadedTestUtil.java:139) at org.apache.hadoop.hbase.MultithreadedTestUtil$TestThread.run(MultithreadedTestUtil.java:115)
        Hide
        Li Pi added a comment -

        Yeah. This doesn't take care of it completely. I'll figure something out. It
        does take care of one case though.
        On Sep 9, 2011 9:49 PM, "Ted Yu (JIRA)" <jira@apache.org> wrote:

        Show
        Li Pi added a comment - Yeah. This doesn't take care of it completely. I'll figure something out. It does take care of one case though. On Sep 9, 2011 9:49 PM, "Ted Yu (JIRA)" <jira@apache.org> wrote:
        Hide
        Li Pi added a comment -

        I'm looping mvn on the latest patch - I don't see any failures yet.

        Show
        Li Pi added a comment - I'm looping mvn on the latest patch - I don't see any failures yet.
        Hide
        Li Pi added a comment -

        The change in v7 that fixes things:

        • scache.cacheBlock(blockName, cachedItem); // if this
        • // fails, due to
        • // block already
        • // being there, exception will be thrown
        • backingStore.put(blockName, scache);
          +
          + /This will throw a runtime exception if we try to cache the same value twice/
          + scache.cacheBlock(blockName, cachedItem);
          +
          + /Spinlock, if we're spinlocking, that means an eviction hasn't taken place yet/
          + while (backingStore.putIfAbsent(blockName, scache) != null) { + Thread.yield(); + }

        The test failed when the following occurred:

        Invariant:

        Both SSC and SC have the same contents:

        Violation:

        Item A is in both SSC and SC.

        Thread A: evicts A from ssc.
        Thread B: starts doing a put into thread SC, -
        Thread B: gets directed into SSC, starts doing put in SSC,
        Thread B: put goes through thanks to the occuring eviction.
        Thread A: calls evictor on SC, removing the object from SlabCache.

        result: Object is in SSC, but not SC.

        Show
        Li Pi added a comment - The change in v7 that fixes things: scache.cacheBlock(blockName, cachedItem); // if this // fails, due to // block already // being there, exception will be thrown backingStore.put(blockName, scache); + + / This will throw a runtime exception if we try to cache the same value twice / + scache.cacheBlock(blockName, cachedItem); + + / Spinlock, if we're spinlocking, that means an eviction hasn't taken place yet / + while (backingStore.putIfAbsent(blockName, scache) != null) { + Thread.yield(); + } The test failed when the following occurred: Invariant: Both SSC and SC have the same contents: Violation: Item A is in both SSC and SC. Thread A: evicts A from ssc. Thread B: starts doing a put into thread SC, - Thread B: gets directed into SSC, starts doing put in SSC, Thread B: put goes through thanks to the occuring eviction. Thread A: calls evictor on SC, removing the object from SlabCache. result: Object is in SSC, but not SC.
        Hide
        Ted Yu added a comment -

        +1 on patch v7.

        Show
        Ted Yu added a comment - +1 on patch v7.
        Hide
        Ted Yu added a comment -

        So far the slab unit tests passed on MacBook and Linux.
        Integrated to TRUNK.

        Thanks for the patch Li and Todd.

        Show
        Ted Yu added a comment - So far the slab unit tests passed on MacBook and Linux. Integrated to TRUNK. Thanks for the patch Li and Todd.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2211 (See https://builds.apache.org/job/HBase-TRUNK/2211/)
        HBASE-4330 Fix races in slab cache (Li Pi & Todd)

        tedyu :
        Files :

        • /hbase/trunk/CHANGES.txt
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SingleSizeCache.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/Slab.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SlabCache.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/slab/TestSingleSizeCache.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/slab/TestSlab.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/slab/TestSlabCache.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2211 (See https://builds.apache.org/job/HBase-TRUNK/2211/ ) HBASE-4330 Fix races in slab cache (Li Pi & Todd) tedyu : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SingleSizeCache.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/Slab.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SlabCache.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/slab/TestSingleSizeCache.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/slab/TestSlab.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/slab/TestSlabCache.java

          People

          • Assignee:
            Li Pi
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development