Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10121

BlockCache corruption with high concurrency

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.4
    • Fix Version/s: 6.5
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      Improving the tests of the BlockCache in SOLR-10116 uncovered a corruption bug (either that or the test is flawed... TBD).

      The failing test is TestBlockCache.testBlockCacheConcurrent()

        Issue Links

          Activity

          Hide
          yseeley@gmail.com Yonik Seeley added a comment - - edited

          I reviewed the pertinent BlockCache and haven't seen any thread safety issues yet. Looking at the history of BlockCache, I reverted to right before SOLR-7355 was applied, and the issues went away. So it looks like it could be a thread safety or usage issue with Caffeine?
          Ben Manes, does putting a key/value in Caffeine constitute safe publication to a different thread (as is the case with ConcurrentHashMap for example)?

          Note that this doesn't necessarily mean something is wrong with Caffeine... it may be that the increased concurrency or other allowable differences in behavior uncover a bug in BlockCache as well.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - - edited I reviewed the pertinent BlockCache and haven't seen any thread safety issues yet. Looking at the history of BlockCache, I reverted to right before SOLR-7355 was applied, and the issues went away. So it looks like it could be a thread safety or usage issue with Caffeine? Ben Manes , does putting a key/value in Caffeine constitute safe publication to a different thread (as is the case with ConcurrentHashMap for example)? Note that this doesn't necessarily mean something is wrong with Caffeine... it may be that the increased concurrency or other allowable differences in behavior uncover a bug in BlockCache as well.
          Hide
          ben.manes Ben Manes added a comment -

          Yes, a write should constitute a publication. Caffeine decorates a ConcurrentHashMap but does bypass it at times. By default eviction is asynchronous by delegating to fjp commonPool, but can be configured to use the caller instead. That might be useful for testing.

          Solr uses an old version of Caffeine. A patch was reviewed and approved, but needs someone to merge it in SOLR-8241. I'm not aware of a visibility bug in any release, but staying current would be helpful as I have fixed bugs since that version.

          Show
          ben.manes Ben Manes added a comment - Yes, a write should constitute a publication. Caffeine decorates a ConcurrentHashMap but does bypass it at times. By default eviction is asynchronous by delegating to fjp commonPool, but can be configured to use the caller instead. That might be useful for testing. Solr uses an old version of Caffeine. A patch was reviewed and approved, but needs someone to merge it in SOLR-8241 . I'm not aware of a visibility bug in any release, but staying current would be helpful as I have fixed bugs since that version.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Hmmm, so on further review of BlockCache.java, I think I've found 2 concurrency issues.
          Unfortunately, fixing those issues does not get my test to pass.
          Another "issue" is that my test did pass pre-Caffeine, which means the test is not good enough at sussing out issues (since the BlockCache bugs I identified should not depend on the underlying map implementation).

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Hmmm, so on further review of BlockCache.java, I think I've found 2 concurrency issues. Unfortunately, fixing those issues does not get my test to pass. Another "issue" is that my test did pass pre-Caffeine, which means the test is not good enough at sussing out issues (since the BlockCache bugs I identified should not depend on the underlying map implementation).
          Hide
          ben.manes Ben Manes added a comment -

          Can you try a local hack of changing Caffeine versions and, if it fails, try reverting back to CLHM? Both should be easy changes that could help us isolate it.

          Also note that CLHM ran the eviction listener on the same thread, whereas Caffeine delegates that to the executor. If there is a race due to that, you could use `executor(Runnable::run)` in the builder.

          Show
          ben.manes Ben Manes added a comment - Can you try a local hack of changing Caffeine versions and, if it fails, try reverting back to CLHM? Both should be easy changes that could help us isolate it. Also note that CLHM ran the eviction listener on the same thread, whereas Caffeine delegates that to the executor. If there is a race due to that, you could use `executor(Runnable::run)` in the builder.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Thanks for the extra info - running the eviction listener in a separate thread shouldn't matter for correctness, but may work better the way this BlockCache code is written anyway.

          I went back and re-tested right before the Caffeine switch (SOLR-7355) and was able to reproduce some fails by bumping up the concurrency.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Thanks for the extra info - running the eviction listener in a separate thread shouldn't matter for correctness, but may work better the way this BlockCache code is written anyway. I went back and re-tested right before the Caffeine switch ( SOLR-7355 ) and was able to reproduce some fails by bumping up the concurrency.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Here's a draft patch to fix the currently identified concurrency issues in BlockCache.

          • fetch() checked isRemoved() before the read from the block, but the block could be reused after that point (i.e. before or during the read), causing the wrong data to be returned.
          • store() allowed existing blocks to be updated, but resulted in corruption. The reason is that if one retrieves an existing block, one does not know when that block may stop being used for one key and start being used for another key. To safely do this, one would require write locks. Since we don't need the functionality, I simply failed the case of trying to cache/update a block already in the cache.
          Show
          yseeley@gmail.com Yonik Seeley added a comment - Here's a draft patch to fix the currently identified concurrency issues in BlockCache. fetch() checked isRemoved() before the read from the block, but the block could be reused after that point (i.e. before or during the read), causing the wrong data to be returned. store() allowed existing blocks to be updated, but resulted in corruption. The reason is that if one retrieves an existing block, one does not know when that block may stop being used for one key and start being used for another key. To safely do this, one would require write locks. Since we don't need the functionality, I simply failed the case of trying to cache/update a block already in the cache.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit b71a667d74dfabeaad9584372bded80b0c609add in lucene-solr's branch refs/heads/master from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b71a667 ]

          SOLR-10121: fix race conditions in BlockCache.fetch and BlockCache.store

          Show
          jira-bot ASF subversion and git services added a comment - Commit b71a667d74dfabeaad9584372bded80b0c609add in lucene-solr's branch refs/heads/master from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b71a667 ] SOLR-10121 : fix race conditions in BlockCache.fetch and BlockCache.store
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 65e2d2add68a557b1e628039c328f9346df282f9 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=65e2d2a ]

          SOLR-10121: fix race conditions in BlockCache.fetch and BlockCache.store

          Show
          jira-bot ASF subversion and git services added a comment - Commit 65e2d2add68a557b1e628039c328f9346df282f9 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=65e2d2a ] SOLR-10121 : fix race conditions in BlockCache.fetch and BlockCache.store
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          I'm splitting off the Caffeine issues to SOLR-10141 since the BlockCache race conditions that have existed since inception and will need to be handled/backported separately.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - I'm splitting off the Caffeine issues to SOLR-10141 since the BlockCache race conditions that have existed since inception and will need to be handled/backported separately.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit cf1cba66f49c551cddbc6053565c30bf3a8b23bb in lucene-solr's branch refs/heads/master from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cf1cba6 ]

          SOLR-10121: enable BlockCacheTest.testBlockCacheConcurrent that now passes

          Show
          jira-bot ASF subversion and git services added a comment - Commit cf1cba66f49c551cddbc6053565c30bf3a8b23bb in lucene-solr's branch refs/heads/master from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cf1cba6 ] SOLR-10121 : enable BlockCacheTest.testBlockCacheConcurrent that now passes
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8dbb1bb3fb64fea4baa672ce82a1b62af22c3571 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8dbb1bb ]

          SOLR-10121: enable BlockCacheTest.testBlockCacheConcurrent that now passes

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8dbb1bb3fb64fea4baa672ce82a1b62af22c3571 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8dbb1bb ] SOLR-10121 : enable BlockCacheTest.testBlockCacheConcurrent that now passes

            People

            • Assignee:
              yseeley@gmail.com Yonik Seeley
              Reporter:
              yseeley@gmail.com Yonik Seeley
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development