Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9524

SolrIndexSearcher.getIndexFingerprint uses dubious sunchronization

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.3
    • Fix Version/s: 6.3, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      In SOLR-9310 we added more code that does some fingerprint caching in SolrIndexSearcher. However, the synchronization looks like it could be made more efficient and may have issues with correctness.

      https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L2371-L2385

      Some of the issues:

      • Double checked locking needs use of volatile variables to ensure proper memory semantics.
      • sync on a ConcurrentHashMap is usually a code smell

        Issue Links

          Activity

          Hide
          mdrob Mike Drob added a comment -

          Might not actually be DCL because we do reload the value. I still think this method can be cleaned up though.

          Show
          mdrob Mike Drob added a comment - Might not actually be DCL because we do reload the value. I still think this method can be cleaned up though.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          It wouldn't be a DCL anti-pattern (i.e. not thread safe) in either case because the underlying cache is thread safe.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - It wouldn't be a DCL anti-pattern (i.e. not thread safe) in either case because the underlying cache is thread safe.
          Hide
          mbraun688 Michael Braun added a comment -

          maxVersionFingerprintCache.computeIfAbsent seems a good replacement for this functionality but there is a comment about the operation being possibly expensive such that one may want to actually block on multiple threads calculating the same value, which computeIfAbsent will not do.

          Show
          mbraun688 Michael Braun added a comment - maxVersionFingerprintCache.computeIfAbsent seems a good replacement for this functionality but there is a comment about the operation being possibly expensive such that one may want to actually block on multiple threads calculating the same value, which computeIfAbsent will not do.
          Hide
          mbraun688 Michael Braun added a comment -

          The existing code actually seems to have a bug in a different way:

                fingerprint = maxVersionFingerprintCache.get(maxVersionFingerprintCache);
          

          It's looking its own reference up instead of the maxVersion long, which it checks earlier (and puts as).

          Show
          mbraun688 Michael Braun added a comment - The existing code actually seems to have a bug in a different way: fingerprint = maxVersionFingerprintCache.get(maxVersionFingerprintCache); It's looking its own reference up instead of the maxVersion long, which it checks earlier (and puts as).
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Ha - nice catch!

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Ha - nice catch!
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Since SOLR-9310 has already been released (5.5.3?), we can't re-open it to fix. Might as well just use this JIRA?

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Since SOLR-9310 has already been released (5.5.3?), we can't re-open it to fix. Might as well just use this JIRA?
          Hide
          mdrob Mike Drob added a comment -

          Chatted with Michael Braun about this a bit...

          I think computeIfAbsent is close to what we want, but it doesn't let us easily preserve the current behaviour to throw IOException - we'd have to wrap it as an RTE. I'm also a little sceptical of the claim that multiple threads would be attempting to calculate the same fingerprint, since this is only called when we open a new searcher.

          I'll agree that it may be expensive, but I don't what code paths we are trying to protect against. And I think we'd get much better gains by doing the optimizations at the lower level... There's a note in IndexFingerprint that // TODO: this could be parallelized, or even cached per-segment if performance becomes an issue, which I think is where we should look instead of trying to ensure that a fingerprint is only calculated once.

          If we do need to ensure that each fingerprint is only calculated once for performance issues, then we should switch to a more granular locking mechanism. Possibly something like striped locking, or a parallel Map<Long, Lock>. This is a lot more complexity.

          Show
          mdrob Mike Drob added a comment - Chatted with Michael Braun about this a bit... I think computeIfAbsent is close to what we want, but it doesn't let us easily preserve the current behaviour to throw IOException - we'd have to wrap it as an RTE. I'm also a little sceptical of the claim that multiple threads would be attempting to calculate the same fingerprint, since this is only called when we open a new searcher. I'll agree that it may be expensive, but I don't what code paths we are trying to protect against. And I think we'd get much better gains by doing the optimizations at the lower level... There's a note in IndexFingerprint that // TODO: this could be parallelized, or even cached per-segment if performance becomes an issue , which I think is where we should look instead of trying to ensure that a fingerprint is only calculated once. If we do need to ensure that each fingerprint is only calculated once for performance issues, then we should switch to a more granular locking mechanism. Possibly something like striped locking, or a parallel Map<Long, Lock>. This is a lot more complexity.
          Hide
          mbraun688 Michael Braun added a comment -

          Mike Drob it's not as pretty as computeifAbsent but it probably make sense from an exception point of view to keep it as is and possibly change the synchronization.

          Show
          mbraun688 Michael Braun added a comment - Mike Drob it's not as pretty as computeifAbsent but it probably make sense from an exception point of view to keep it as is and possibly change the synchronization.
          Hide
          mdrob Mike Drob added a comment -

          According to ConcurrentHashMap.computeIfAbsent javadoc though, "the entire method invocation is performed atomically, so the function is applied at most once per key."

          So if we're willing to live with the clunky exception handling it requires, then I'm back on board with that approach.

          Show
          mdrob Mike Drob added a comment - According to ConcurrentHashMap.computeIfAbsent javadoc though, "the entire method invocation is performed atomically, so the function is applied at most once per key." So if we're willing to live with the clunky exception handling it requires, then I'm back on board with that approach.
          Hide
          mdrob Mike Drob added a comment -

          And looking at the Oracle JDK 8 implementation, I'm reasonably convinced that this is the correct way to go.

          Show
          mdrob Mike Drob added a comment - And looking at the Oracle JDK 8 implementation, I'm reasonably convinced that this is the correct way to go.
          Hide
          mbraun688 Michael Braun added a comment -

          Whoops sorry Mike Drob I was looking at the wrong function! Ignore my previous comment re: not preferring computeIfAbsent.

          Show
          mbraun688 Michael Braun added a comment - Whoops sorry Mike Drob I was looking at the wrong function! Ignore my previous comment re: not preferring computeIfAbsent.
          Hide
          mdrob Mike Drob added a comment -

          Attaching a patch for master branch.

          Yonik Seeley, Noble Paul - you've been in this code most recently, what do you think?

          Show
          mdrob Mike Drob added a comment - Attaching a patch for master branch. Yonik Seeley , Noble Paul - you've been in this code most recently, what do you think?
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          I'm also a little sceptical of the claim that multiple threads would be attempting to calculate the same fingerprint

          When a new leader is elected, I think everyone peersyncs to that leader.
          When I originally wrote the code it was just a single fingerprint value (no max rolled in), and so it made sense to cache (since the common case is multiple replicas asking for the same info around the same time). After the code was modified to use maxversion, it's less clear what the overlap will be, but if everyone is already in sync, it should still be high.

          > There's a note in IndexFingerprint that // TODO: this could be parallelized, or even cached per-segment if performance becomes an issue, which I think is where we should look instead of trying to ensure that a fingerprint is only calculated once.

          I don't see those as exclusive... caching a top-level fingerprint is desirable until we have per-segment caching.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - I'm also a little sceptical of the claim that multiple threads would be attempting to calculate the same fingerprint When a new leader is elected, I think everyone peersyncs to that leader. When I originally wrote the code it was just a single fingerprint value (no max rolled in), and so it made sense to cache (since the common case is multiple replicas asking for the same info around the same time). After the code was modified to use maxversion, it's less clear what the overlap will be, but if everyone is already in sync, it should still be high. > There's a note in IndexFingerprint that // TODO: this could be parallelized, or even cached per-segment if performance becomes an issue, which I think is where we should look instead of trying to ensure that a fingerprint is only calculated once. I don't see those as exclusive... caching a top-level fingerprint is desirable until we have per-segment caching.
          Hide
          mdrob Mike Drob added a comment -

          Yea, that makes sense. I'm much less worried about correctness and efficiency when we can delegate to the JDK and trust them to do it correctly.

          Show
          mdrob Mike Drob added a comment - Yea, that makes sense. I'm much less worried about correctness and efficiency when we can delegate to the JDK and trust them to do it correctly.
          Hide
          dsmiley David Smiley added a comment -

          RE computeIfAbsent and IOException: I recent reviewed some code and gave feedback on an issue pertaining to this (I can't seen to find right it now). The code on whatever issue it was could have theoretically used computeIfAbsent and would thus have been much nicer were it not for the IOException. So we couldn't use it; what a pain, sigh.... If we want to use it here too (were it not for IOException) we might want to add a utility to make this easy so that the semantics of what we're doing is clear. computeIfAbsent is beautiful when you can use it.

          Show
          dsmiley David Smiley added a comment - RE computeIfAbsent and IOException: I recent reviewed some code and gave feedback on an issue pertaining to this (I can't seen to find right it now). The code on whatever issue it was could have theoretically used computeIfAbsent and would thus have been much nicer were it not for the IOException. So we couldn't use it; what a pain, sigh... . If we want to use it here too (were it not for IOException) we might want to add a utility to make this easy so that the semantics of what we're doing is clear. computeIfAbsent is beautiful when you can use it.
          Hide
          praste Pushkar Raste added a comment -

          The reason I didn't use computeIfAbsent is computing fingerprint could throw an exception. Code to account for exception looks pretty ugly.

          It also seems like some changes to the original patch, I dont remember using ConcurrentHashmap.

          I will double check though.

          Show
          praste Pushkar Raste added a comment - The reason I didn't use computeIfAbsent is computing fingerprint could throw an exception. Code to account for exception looks pretty ugly. It also seems like some changes to the original patch, I dont remember using ConcurrentHashmap. I will double check though.
          Hide
          praste Pushkar Raste added a comment -

          PS : We are planning to cache fingerprint per segment under Solr-9506

          Show
          praste Pushkar Raste added a comment - PS : We are planning to cache fingerprint per segment under Solr-9506
          Hide
          praste Pushkar Raste added a comment -

          May be we should hold off 6.2.1 release

          Show
          praste Pushkar Raste added a comment - May be we should hold off 6.2.1 release
          Hide
          praste Pushkar Raste added a comment -

          Nice catch.

          May be we should right a test case, now that twice fingerprint caching got us in trouble.

          Show
          praste Pushkar Raste added a comment - Nice catch. May be we should right a test case, now that twice fingerprint caching got us in trouble.
          Hide
          noble.paul Noble Paul added a comment -

          fingerprint = maxVersionFingerprintCache.get(maxVersionFingerprintCache);

          LOL, nice bug.

          May be we should hold off 6.2.1 release

          Pushkar Raste I don't think it's a big deal. Essentially it's just made caching useless. We didn't have any caching till now. So, it is not going to be any worse than what it used to be

          Show
          noble.paul Noble Paul added a comment - fingerprint = maxVersionFingerprintCache.get(maxVersionFingerprintCache); LOL, nice bug. May be we should hold off 6.2.1 release Pushkar Raste I don't think it's a big deal. Essentially it's just made caching useless. We didn't have any caching till now. So, it is not going to be any worse than what it used to be
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          We didn't have any caching till now. So, it is not going to be any worse than what it used to be

          It was cached before SOLR-9310, so if everyone was in sync (say coming back up from a reboot, or after electing a new leader), it could certainly be more expensive than it was.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - We didn't have any caching till now. So, it is not going to be any worse than what it used to be It was cached before SOLR-9310 , so if everyone was in sync (say coming back up from a reboot, or after electing a new leader), it could certainly be more expensive than it was.
          Hide
          mdrob Mike Drob added a comment -

          RE computeIfAbsent and IOException: I recent reviewed some code and gave feedback on an issue pertaining to this (I can't seen to find right it now). The code on whatever issue it was could have theoretically used computeIfAbsent and would thus have been much nicer were it not for the IOException. So we couldn't use it; what a pain, sigh.... If we want to use it here too (were it not for IOException) we might want to add a utility to make this easy so that the semantics of what we're doing is clear. computeIfAbsent is beautiful when you can use it.

          I've been poking at this, but due to some quirks of java generics, it's harder to put in a generic utility than it would appear at first blush. Going down this route, we may have to limit ourselves strictly to IOException.

          I don't think it's a big deal. Essentially it's just made caching useless. We didn't have any caching till now. So, it is not going to be any worse than what it used to be

          It only made the second check useless. The first check and the put are still correct, so we will still have some amount of caching right now, but multiple threads asking at once could end up all doing the work.

          Show
          mdrob Mike Drob added a comment - RE computeIfAbsent and IOException: I recent reviewed some code and gave feedback on an issue pertaining to this (I can't seen to find right it now). The code on whatever issue it was could have theoretically used computeIfAbsent and would thus have been much nicer were it not for the IOException. So we couldn't use it; what a pain, sigh.... If we want to use it here too (were it not for IOException) we might want to add a utility to make this easy so that the semantics of what we're doing is clear. computeIfAbsent is beautiful when you can use it. I've been poking at this, but due to some quirks of java generics, it's harder to put in a generic utility than it would appear at first blush. Going down this route, we may have to limit ourselves strictly to IOException. I don't think it's a big deal. Essentially it's just made caching useless. We didn't have any caching till now. So, it is not going to be any worse than what it used to be It only made the second check useless. The first check and the put are still correct, so we will still have some amount of caching right now, but multiple threads asking at once could end up all doing the work.
          Hide
          praste Pushkar Raste added a comment -

          I will look through logs of our collections (fortunately we have plenty of those) and see how expensive is fingerprint computation. There is already a ticket @noble opened for caching fingerprint per segment. I am going to get to it next week.

          Show
          praste Pushkar Raste added a comment - I will look through logs of our collections (fortunately we have plenty of those) and see how expensive is fingerprint computation. There is already a ticket @noble opened for caching fingerprint per segment. I am going to get to it next week.
          Hide
          mdrob Mike Drob added a comment -

          I will look through logs of our collections (fortunately we have plenty of those) and see how expensive is fingerprint computation.

          Pushkar Raste - did you get a chance to verify this? I tried to check on our systems, but the first few deployments I looked at were all on versions before index fingerprinting.

          Show
          mdrob Mike Drob added a comment - I will look through logs of our collections (fortunately we have plenty of those) and see how expensive is fingerprint computation. Pushkar Raste - did you get a chance to verify this? I tried to check on our systems, but the first few deployments I looked at were all on versions before index fingerprinting.
          Hide
          praste Pushkar Raste added a comment -

          I checked logs on one of our biggest collection with following stats
          num docs: 1591412892
          shards: 12 spread across two solr instances (each process hosts 6 shards and each instance runs on separate physical box).
          auto soft commit : 1 secs (not sure this matters as computing fingerprint opens a new NR searcher anyway)
          There is replication too but I don't think that matters a lot here.

          It is typically taking ~2 seconds on avg.

          This definitely requires caching.

          I am planning to take on SOLR-9506 this week.

          Show
          praste Pushkar Raste added a comment - I checked logs on one of our biggest collection with following stats num docs: 1591412892 shards: 12 spread across two solr instances (each process hosts 6 shards and each instance runs on separate physical box). auto soft commit : 1 secs (not sure this matters as computing fingerprint opens a new NR searcher anyway) There is replication too but I don't think that matters a lot here. It is typically taking ~2 seconds on avg. This definitely requires caching. I am planning to take on SOLR-9506 this week.
          Hide
          mdrob Mike Drob added a comment -

          Thanks for looking. I agree that SOLR-9506 would be a good improvement too.

          The reason I didn't use computeIfAbsent is computing fingerprint could throw an exception. Code to account for exception looks pretty ugly.

          It also seems like some changes to the original patch, I dont remember using ConcurrentHashmap.

          The CHM was added in commit 37ae065 by Noble.

          I think that the ugliness of the code around exception handling here is a fine tradeoff for the correctness issue that we've already seen and the performance issue of locking on the whole map. If you have 5 fingerprints to calculate, then the current code will need 10 seconds total since each calculation can only happen serially. Letting computeIfAbsent manage the concurrent access for us seems much much better.

          Show
          mdrob Mike Drob added a comment - Thanks for looking. I agree that SOLR-9506 would be a good improvement too. The reason I didn't use computeIfAbsent is computing fingerprint could throw an exception. Code to account for exception looks pretty ugly. It also seems like some changes to the original patch, I dont remember using ConcurrentHashmap. The CHM was added in commit 37ae065 by Noble. I think that the ugliness of the code around exception handling here is a fine tradeoff for the correctness issue that we've already seen and the performance issue of locking on the whole map. If you have 5 fingerprints to calculate, then the current code will need 10 seconds total since each calculation can only happen serially. Letting computeIfAbsent manage the concurrent access for us seems much much better.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit afc57347b47322290d6b0e6c00e4e3413ce2fbf0 in lucene-solr's branch refs/heads/master from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=afc5734 ]

          SOLR-9524: SolrIndexSearcher.getIndexFingerprint uses dubious synchronization

          Show
          jira-bot ASF subversion and git services added a comment - Commit afc57347b47322290d6b0e6c00e4e3413ce2fbf0 in lucene-solr's branch refs/heads/master from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=afc5734 ] SOLR-9524 : SolrIndexSearcher.getIndexFingerprint uses dubious synchronization
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 7a1e6efa9678f9cdfb3f59f61fba6e60e725f3a7 in lucene-solr's branch refs/heads/branch_6x from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7a1e6ef ]

          SOLR-9524: SolrIndexSearcher.getIndexFingerprint uses dubious synchronization

          Show
          jira-bot ASF subversion and git services added a comment - Commit 7a1e6efa9678f9cdfb3f59f61fba6e60e725f3a7 in lucene-solr's branch refs/heads/branch_6x from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7a1e6ef ] SOLR-9524 : SolrIndexSearcher.getIndexFingerprint uses dubious synchronization
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          I wonder if we could use the same type of logic for UnInvertedField.getUnInvertedField().... perhaps by adding an additional method to SolrCache that takes a creator?

          Show
          yseeley@gmail.com Yonik Seeley added a comment - I wonder if we could use the same type of logic for UnInvertedField.getUnInvertedField().... perhaps by adding an additional method to SolrCache that takes a creator?
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Closing after 6.3.0 release.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

            People

            • Assignee:
              noble.paul Noble Paul
              Reporter:
              mdrob Mike Drob
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development