Solr
  1. Solr
  2. SOLR-7585

ConcurrentLFUCache throws NoSuchElementException under a write-heavy load

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 5.1
    • Fix Version/s: 5.2, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      Under a write-heavy load ConcurrentLFUCache throws NoSuchElementException. The problem lies within org.apache.solr.util.ConcurrentLFUCache#put method, which allows for a race condition between the check and the call to markAndSweep method. Despite that a thread must acquire a lock to perform sweeping, it's still possible that multiple threads successfully detected a need for calling markAndSweep. If they execute it sequentially, subsequent runs will fail with NoSuchElementException.

      1. SOLR-7585.patch
        8 kB
        Shawn Heisey
      2. SOLR-7585.patch
        7 kB
        Shawn Heisey
      3. SOLR-7585.patch
        5 kB
        Shawn Heisey
      4. SOLR-7585.patch
        3 kB
        Maciej Zasada
      5. SOLR-7585.patch
        3 kB
        Maciej Zasada

        Activity

        Hide
        Maciej Zasada added a comment -

        Attached a patch containing a test case + a fix that mitigates the race condition. Cheers!

        Show
        Maciej Zasada added a comment - Attached a patch containing a test case + a fix that mitigates the race condition. Cheers!
        Hide
        Shawn Heisey added a comment -

        The LFU cache implementation is my code, but I didn't write it from scratch. I started with the existing LRU code. Now that I look at ConcurrentLRUCache, the markAndSweep method there seems to be quite a bit different, and I can't remember what I actually did or whether the differences came afterwards.

        I tried adding just the test so I could see if the test fails without the patch. I did so on branch_5x, which is how I discovered that the test uses a lambda expression – it won't compile on branch_5x because the build is targeted for a 1.7 compile version, and lamba expressions require 1.8. Since I haven't yet wrestled with lambda expressions, I have no idea what they actually do. Can you rewrite the test so it's compatible with Java 7?

        Show
        Shawn Heisey added a comment - The LFU cache implementation is my code, but I didn't write it from scratch. I started with the existing LRU code. Now that I look at ConcurrentLRUCache, the markAndSweep method there seems to be quite a bit different, and I can't remember what I actually did or whether the differences came afterwards. I tried adding just the test so I could see if the test fails without the patch. I did so on branch_5x, which is how I discovered that the test uses a lambda expression – it won't compile on branch_5x because the build is targeted for a 1.7 compile version, and lamba expressions require 1.8. Since I haven't yet wrestled with lambda expressions, I have no idea what they actually do. Can you rewrite the test so it's compatible with Java 7?
        Hide
        Maciej Zasada added a comment -

        Of course, latest patch contains same test compatible with Java 7. My bad, I tested against trunk (which is now on Java 8) instead of a 5.x branch. Cheers!

        Show
        Maciej Zasada added a comment - Of course, latest patch contains same test compatible with Java 7. My bad, I tested against trunk (which is now on Java 8) instead of a 5.x branch. Cheers!
        Hide
        Shawn Heisey added a comment -

        Alternate patch that checks the size of the TreeSet when deciding whether to simple add() or do a more complicated check involving remove/add.

        This feels like it fixes the underlying problem, rather than exiting the code early.

        The patch also gets rid of a bunch of warnings about parameterization of objects, and one warning about an unused variable. I chose to suppress the warning instead of removing the variable, because it's a variable that might find a use later.

        Show
        Shawn Heisey added a comment - Alternate patch that checks the size of the TreeSet when deciding whether to simple add() or do a more complicated check involving remove/add. This feels like it fixes the underlying problem, rather than exiting the code early. The patch also gets rid of a bunch of warnings about parameterization of objects, and one warning about an unused variable. I chose to suppress the warning instead of removing the variable, because it's a variable that might find a use later.
        Hide
        Maciej Zasada added a comment -

        Personally I like exiting the code early. If I understand correctly, new implementation will always evict at least one entry, which means cache size may drop below lowerWaterMark (which somehow brakes a contract). To fix underlying problem we should get rid of a race condition at all, but I haven't measure the performance impact of such change. Moreover, first approach has 2 additional benefits:

        • not producing additional garbage
        • not executing unnecessary code (if cache size is below upperWaterMark, there's no need to evict anything)

        both seem to be important for a cache. WDYT?

        Show
        Maciej Zasada added a comment - Personally I like exiting the code early. If I understand correctly, new implementation will always evict at least one entry, which means cache size may drop below lowerWaterMark (which somehow brakes a contract). To fix underlying problem we should get rid of a race condition at all, but I haven't measure the performance impact of such change. Moreover, first approach has 2 additional benefits: not producing additional garbage not executing unnecessary code (if cache size is below upperWaterMark , there's no need to evict anything) both seem to be important for a cache. WDYT?
        Hide
        Shawn Heisey added a comment -

        both seem to be important for a cache. WDYT?

        The benefits you noted certainly do sound like good things, but it looks like markAndSweep was designed to reduce the cache size to lowerWaterMark. Is it acceptable to avoid eviction if the current cache size is somewhere between the two watermarks? Will anything strange happen in a situation where a cache is defined with settings where any of the sizes/watermarks are not different numbers?

        FYI, the entire LFUCache implementation is due for replacement, because it's a very slow and naive implementation, and I figured out a better way to do it. See SOLR-3393.

        Show
        Shawn Heisey added a comment - both seem to be important for a cache. WDYT? The benefits you noted certainly do sound like good things, but it looks like markAndSweep was designed to reduce the cache size to lowerWaterMark. Is it acceptable to avoid eviction if the current cache size is somewhere between the two watermarks? Will anything strange happen in a situation where a cache is defined with settings where any of the sizes/watermarks are not different numbers? FYI, the entire LFUCache implementation is due for replacement, because it's a very slow and naive implementation, and I figured out a better way to do it. See SOLR-3393 .
        Hide
        Maciej Zasada added a comment -

        Yes, I saw SOLR-3393 before, but I assumed it will take a while to deprecate and remove the old class, and we're currently experiencing some exceptions because of this race condition Having in mind that you're working on more performant LFU cache, I tried to make changes as small and noninvasive as possible.

        it looks like markAndSweep was designed to reduce the cache size to lowerWaterMark.

        Exactly. I believe that your change will actually reduce (sometimes) the cache size to be equal to lowerWaterMark -1, instead of equal to lowerWaterMark.

        Is it acceptable to avoid eviction if the current cache size is somewhere between the two watermarks?

        Eviction will be only avoided if the size was less than or equal to upperWaterMark, which is exactly the same condition as in org.apache.solr.util.ConcurrentLFUCache#put method (L#139), where it's decided whether to call markAndSweep at all or not. Essentially, the same condition is checked after acquiring a lock to mitigate a race condition.

        Show
        Maciej Zasada added a comment - Yes, I saw SOLR-3393 before, but I assumed it will take a while to deprecate and remove the old class, and we're currently experiencing some exceptions because of this race condition Having in mind that you're working on more performant LFU cache, I tried to make changes as small and noninvasive as possible. it looks like markAndSweep was designed to reduce the cache size to lowerWaterMark. Exactly. I believe that your change will actually reduce (sometimes) the cache size to be equal to lowerWaterMark -1 , instead of equal to lowerWaterMark . Is it acceptable to avoid eviction if the current cache size is somewhere between the two watermarks? Eviction will be only avoided if the size was less than or equal to upperWaterMark , which is exactly the same condition as in org.apache.solr.util.ConcurrentLFUCache#put method (L#139), where it's decided whether to call markAndSweep at all or not. Essentially, the same condition is checked after acquiring a lock to mitigate a race condition.
        Hide
        Shawn Heisey added a comment -

        Now both checks are present. The method will abort if the cache is already below upperWaterMark, and the code further down will avoid executing the TreeSet#first() method when the set is empty, which is where the NoSuchElementException is actually thrown.

        I hope the comments are clear and correct.

        Show
        Shawn Heisey added a comment - Now both checks are present. The method will abort if the cache is already below upperWaterMark, and the code further down will avoid executing the TreeSet#first() method when the set is empty, which is where the NoSuchElementException is actually thrown. I hope the comments are clear and correct.
        Hide
        Maciej Zasada added a comment -

        Looks good to me, +1. Cheers!

        Show
        Maciej Zasada added a comment - Looks good to me, +1. Cheers!
        Hide
        Shawn Heisey added a comment -

        The test fails precommit.

        [forbidden-apis] Forbidden method invocation: java.util.concurrent.Executors#newFixedThreadPool(int) [Spawns threads with vague names; use a custom thread factory (Lucene's NamedThreadFactory, Solr's DefaultSolrThreadFactory) and name threads so that you can tell (by its name) which executor it is associated with]
        [forbidden-apis] in org.apache.solr.search.TestLFUCache (TestLFUCache.java:371)

        I'm looking at existing uses of DefaultSolrThreadFactory trying to figure out how to adapt it to this test code, but it's not making sense to me.

        Show
        Shawn Heisey added a comment - The test fails precommit. [forbidden-apis] Forbidden method invocation: java.util.concurrent.Executors#newFixedThreadPool(int) [Spawns threads with vague names; use a custom thread factory (Lucene's NamedThreadFactory, Solr's DefaultSolrThreadFactory) and name threads so that you can tell (by its name) which executor it is associated with] [forbidden-apis] in org.apache.solr.search.TestLFUCache (TestLFUCache.java:371) I'm looking at existing uses of DefaultSolrThreadFactory trying to figure out how to adapt it to this test code, but it's not making sense to me.
        Hide
        Shawn Heisey added a comment -

        I think i figured it out. If precommit passes now, I'll have a new patch momentarily.

        Show
        Shawn Heisey added a comment - I think i figured it out. If precommit passes now, I'll have a new patch momentarily.
        Hide
        Shawn Heisey added a comment -

        Apparently 6 threads is not enough to trigger the bug. When I ran the test (without the fix) on a system with three cores, the test passed. When I hard-coded the number of threads in the test to 8, then the test failed as expected.

        Maciej Zasada, should we hard-code the number of threads in the test to a number like 8 or 10?

        Show
        Shawn Heisey added a comment - Apparently 6 threads is not enough to trigger the bug. When I ran the test (without the fix) on a system with three cores, the test passed. When I hard-coded the number of threads in the test to 8, then the test failed as expected. Maciej Zasada , should we hard-code the number of threads in the test to a number like 8 or 10?
        Hide
        Shawn Heisey added a comment -

        New patch against trunk. Used DefaultSolrThreadFactory to fix the precommit. Also hard-coded the number of threads in the new test to 10 so that the unpatched ConcurrentLFUCache implementation fails the test even on systems with less than four CPUs.

        The CHANGES.txt assumes that the fix is going into version 5.2. I've asked the release manager for 5.2 whether this fix is OK to backport.

        Show
        Shawn Heisey added a comment - New patch against trunk. Used DefaultSolrThreadFactory to fix the precommit. Also hard-coded the number of threads in the new test to 10 so that the unpatched ConcurrentLFUCache implementation fails the test even on systems with less than four CPUs. The CHANGES.txt assumes that the fix is going into version 5.2. I've asked the release manager for 5.2 whether this fix is OK to backport.
        Hide
        Maciej Zasada added a comment -

        +1 for a fixed thread count and DefaultSolrThreadFactory. Cheers!

        Show
        Maciej Zasada added a comment - +1 for a fixed thread count and DefaultSolrThreadFactory. Cheers!
        Hide
        ASF subversion and git services added a comment -

        Commit 1681449 from Shawn Heisey in branch 'dev/trunk'
        [ https://svn.apache.org/r1681449 ]

        SOLR-7585: Fix NoSuchElementException in LFUCache

        Show
        ASF subversion and git services added a comment - Commit 1681449 from Shawn Heisey in branch 'dev/trunk' [ https://svn.apache.org/r1681449 ] SOLR-7585 : Fix NoSuchElementException in LFUCache
        Hide
        ASF subversion and git services added a comment -

        Commit 1681453 from Shawn Heisey in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1681453 ]

        SOLR-7585: Fix NoSuchElementException in LFUCache (merge trunk r1681449)

        Show
        ASF subversion and git services added a comment - Commit 1681453 from Shawn Heisey in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1681453 ] SOLR-7585 : Fix NoSuchElementException in LFUCache (merge trunk r1681449)
        Hide
        ASF subversion and git services added a comment -

        Commit 1681456 from Shawn Heisey in branch 'dev/branches/lucene_solr_5_2'
        [ https://svn.apache.org/r1681456 ]

        SOLR-7585: Fix NoSuchElementException in LFUCache (merge trunk r1681449)

        Show
        ASF subversion and git services added a comment - Commit 1681456 from Shawn Heisey in branch 'dev/branches/lucene_solr_5_2' [ https://svn.apache.org/r1681456 ] SOLR-7585 : Fix NoSuchElementException in LFUCache (merge trunk r1681449)
        Hide
        Shawn Heisey added a comment -

        Committed to trunk, branch_5x, and lucene_solr_5_2. Precommit passed for each. Also made sure that the new test failed without the fix and passed with the fix. Thanks for figuring this out, Maciej Zasada!

        Show
        Shawn Heisey added a comment - Committed to trunk, branch_5x, and lucene_solr_5_2. Precommit passed for each. Also made sure that the new test failed without the fix and passed with the fix. Thanks for figuring this out, Maciej Zasada !
        Hide
        Anshum Gupta added a comment -

        Bulk close for 5.2.0.

        Show
        Anshum Gupta added a comment - Bulk close for 5.2.0.

          People

          • Assignee:
            Shawn Heisey
            Reporter:
            Maciej Zasada
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development