Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9405

ConcurrentModificationException in ZkStateReader.getStateWatchers

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.1
    • Fix Version/s: 6.2, 7.0
    • Component/s: SolrCloud
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      Jenkins found this: http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/1432/

      Stack Trace:
      java.util.ConcurrentModificationException
              at __randomizedtesting.SeedInfo.seed([FA459DF725097EFF:A77E52876204E1C1]:0)
              at java.util.HashMap$HashIterator.nextNode(java.base@9-ea/HashMap.java:1489)
              at java.util.HashMap$KeyIterator.next(java.base@9-ea/HashMap.java:1513)
              at java.util.AbstractCollection.addAll(java.base@9-ea/AbstractCollection.java:351)
              at java.util.HashSet.<init>(java.base@9-ea/HashSet.java:119)
              at org.apache.solr.common.cloud.ZkStateReader.getStateWatchers(ZkStateReader.java:1279)
              at org.apache.solr.common.cloud.TestCollectionStateWatchers.testSimpleCollectionWatch(TestCollectionStateWatchers.java:116)
      
      1. SOLR_9405_no_compute.patch
        2 kB
        David Smiley
      2. SOLR-9405.patch
        6 kB
        Shalin Shekhar Mangar
      3. SOLR-9405.patch
        6 kB
        Alan Woodward

        Activity

        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -
        Show
        shalinmangar Shalin Shekhar Mangar added a comment - FYI Alan Woodward
        Hide
        romseygeek Alan Woodward added a comment -

        Patch fixing the issue, and also dealing with a race in the test.

        Show
        romseygeek Alan Woodward added a comment - Patch fixing the issue, and also dealing with a race in the test.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Thanks Alan. I am not sure how CollectionWatch#stateWatchers is thread-safe. Is it because all access to it is inside of ConcurrentHashMap#compute? The javadocs for compute only say that the entire method invocation is performed atomically. That doesn't necessarily mean that it is synchronized.

        What about access to the size() method? We access it independently in the canBeRemoved() method. Is that likely to cause a problem?

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Thanks Alan. I am not sure how CollectionWatch#stateWatchers is thread-safe. Is it because all access to it is inside of ConcurrentHashMap#compute? The javadocs for compute only say that the entire method invocation is performed atomically. That doesn't necessarily mean that it is synchronized. What about access to the size() method? We access it independently in the canBeRemoved() method. Is that likely to cause a problem?
        Hide
        romseygeek Alan Woodward added a comment -

        That doesn't necessarily mean that it is synchronized

        Hm, I disagree here - if something is performed atomically, it necessarily implies that it's synchronized, doesn't it? And the actual JVM implementation certainly uses synchronized blocks under the hood.

        canBeRemoved() is also only ever called from inside a compute() method call, so we should be thread-safe here as well.

        If we want to be totally safe, though, it should be possible to add a lock Object to CollectionWatch and synchronize on that inside the compute() blocks each time - I'll try it out.

        Show
        romseygeek Alan Woodward added a comment - That doesn't necessarily mean that it is synchronized Hm, I disagree here - if something is performed atomically, it necessarily implies that it's synchronized, doesn't it? And the actual JVM implementation certainly uses synchronized blocks under the hood. canBeRemoved() is also only ever called from inside a compute() method call, so we should be thread-safe here as well. If we want to be totally safe, though, it should be possible to add a lock Object to CollectionWatch and synchronize on that inside the compute() blocks each time - I'll try it out.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Hm, I disagree here - if something is performed atomically, it necessarily implies that it's synchronized, doesn't it?

        A method which calls compare and set is atomic but doesn't necessarily synchronize.

        And the actual JVM implementation certainly uses synchronized blocks under the hood.

        I'm not very familiar with the new collection methods such as compute and that is why I asked if this is really synchronized. All I am saying is that the Javadocs don't explicitly say that it is synchronized and therefore we should be vary of relying on such behaviour. It is possible that my understanding is wrong but I'd still like to understand why if you can help. If you are confident that the current code is correct, a code comment on stateWatchers stating why and how it must be used in future would be nice to have.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Hm, I disagree here - if something is performed atomically, it necessarily implies that it's synchronized, doesn't it? A method which calls compare and set is atomic but doesn't necessarily synchronize. And the actual JVM implementation certainly uses synchronized blocks under the hood. I'm not very familiar with the new collection methods such as compute and that is why I asked if this is really synchronized. All I am saying is that the Javadocs don't explicitly say that it is synchronized and therefore we should be vary of relying on such behaviour. It is possible that my understanding is wrong but I'd still like to understand why if you can help. If you are confident that the current code is correct, a code comment on stateWatchers stating why and how it must be used in future would be nice to have.
        Hide
        eribeiro Edward Ribeiro added a comment -

        Hi guys, I guess this patch doesn't solve the issue at hand.

        TL;DR: the solution is to declare stateWatchers as stateWatchers = ConcurrentHashMap.newKeySet(); at L#150. That is, no need to modify getStateWatchers.

        Please, let me explain.

        First, the error is happening here:

          /* package-private for testing */
         1. Set<CollectionStateWatcher> getStateWatchers(String collection) {
         2.   CollectionWatch watch = collectionWatches.get(collection);
         3.   if (watch == null)
         4.     return null;
         5.   return new HashSet<>(watch.stateWatchers);
         6. }
        

        That is, ZkStateReader.getStateWatchers is creating a new HashSet instance by providing a new collection: CollectionWatch#stateWatchers. As we can see in ZkStateReader, watch#stateWatchers is also a HashSet.

        Okay, If we look into HashSet/AbstractCollection source code, we see that the constructor seen at line 5 (ABOVE) basically calls the addAll method passing the collection provided via the constructor. Then addAll basically loops on collection provided including the elements one by one in the new collection. See in the stack trace provided in this issue:

        at java.util.HashMap$KeyIterator.next(java.base@9-ea/HashMap.java:1513) // RUNNING THE FOR-EACH LOOP
        at java.util.AbstractCollection.addAll(java.base@9-ea/AbstractCollection.java:351)  // DELEGATES TO ADDALL
        at java.util.HashSet.<init>(java.base@9-ea/HashSet.java:119)  // THE CONSTRUCTOR
        

        In a nutshell, what is happening is that while we are populating the new HashSet instance at line 5 of ZkStateReader a new Thread changes stateWatchers concurrently. This throws the ConcurrentModificationException

        The proposed patch doesn't solve the issue because even if collectionWatches.compute is atomic, none of the Sets in use in ZkStateReader is thread-safe.

        I wrote a test program to demonstrate my speculation: https://gist.github.com/eribeiro/4141df2d02c62d7370101bc4349cd8c4

        Finally, sorry if I misunderstood the problem, and let me know if what I wrote made any sense.

        Cheers!

        Show
        eribeiro Edward Ribeiro added a comment - Hi guys, I guess this patch doesn't solve the issue at hand. TL;DR: the solution is to declare stateWatchers as stateWatchers = ConcurrentHashMap.newKeySet(); at L#150. That is, no need to modify getStateWatchers . Please, let me explain. First, the error is happening here: /* package - private for testing */ 1. Set<CollectionStateWatcher> getStateWatchers( String collection) { 2. CollectionWatch watch = collectionWatches.get(collection); 3. if (watch == null ) 4. return null ; 5. return new HashSet<>(watch.stateWatchers); 6. } That is, ZkStateReader.getStateWatchers is creating a new HashSet instance by providing a new collection: CollectionWatch#stateWatchers . As we can see in ZkStateReader, watch#stateWatchers is also a HashSet . Okay, If we look into HashSet / AbstractCollection source code, we see that the constructor seen at line 5 (ABOVE) basically calls the addAll method passing the collection provided via the constructor. Then addAll basically loops on collection provided including the elements one by one in the new collection. See in the stack trace provided in this issue: at java.util.HashMap$KeyIterator.next(java.base@9-ea/HashMap.java:1513) // RUNNING THE FOR-EACH LOOP at java.util.AbstractCollection.addAll(java.base@9-ea/AbstractCollection.java:351) // DELEGATES TO ADDALL at java.util.HashSet.<init>(java.base@9-ea/HashSet.java:119) // THE CONSTRUCTOR In a nutshell, what is happening is that while we are populating the new HashSet instance at line 5 of ZkStateReader a new Thread changes stateWatchers concurrently. This throws the ConcurrentModificationException The proposed patch doesn't solve the issue because even if collectionWatches.compute is atomic, none of the Sets in use in ZkStateReader is thread-safe. I wrote a test program to demonstrate my speculation: https://gist.github.com/eribeiro/4141df2d02c62d7370101bc4349cd8c4 Finally, sorry if I misunderstood the problem, and let me know if what I wrote made any sense. Cheers!
        Hide
        romseygeek Alan Woodward added a comment -

        Thanks Edward, that makes sense.

        I'm about to be away from a keyboard for two weeks - Shalin if you want to take this one in the meantime, feel free, otherwise I'll get to it when I'm back.

        Show
        romseygeek Alan Woodward added a comment - Thanks Edward, that makes sense. I'm about to be away from a keyboard for two weeks - Shalin if you want to take this one in the meantime, feel free, otherwise I'll get to it when I'm back.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Thanks Edward. Yes that's what I was going for. We need to make the stateWatchers thread-safe either by wrapping with synchronizedMap or by using a set backed by ConcurrentHashMap. I'll put up a patch.

        Have fun, Alan!

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Thanks Edward. Yes that's what I was going for. We need to make the stateWatchers thread-safe either by wrapping with synchronizedMap or by using a set backed by ConcurrentHashMap. I'll put up a patch. Have fun, Alan!
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Patch which changes the stateWatcher to a set backed by ConcurrentHashMap. It also includes the race condition fix that Alan had supplied for the TestCollectionStateWatchers in his last patch. I'll commit shortly.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Patch which changes the stateWatcher to a set backed by ConcurrentHashMap. It also includes the race condition fix that Alan had supplied for the TestCollectionStateWatchers in his last patch. I'll commit shortly.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit f82c3b1206011776c55867fb2b5027b824f99812 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f82c3b1 ]

        SOLR-9405: ConcurrentModificationException in ZkStateReader.getStateWatchers

        Show
        jira-bot ASF subversion and git services added a comment - Commit f82c3b1206011776c55867fb2b5027b824f99812 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f82c3b1 ] SOLR-9405 : ConcurrentModificationException in ZkStateReader.getStateWatchers
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 42254c2d9b0e93a96e280b44b2fcbd4b9b6693af in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=42254c2 ]

        SOLR-9405: ConcurrentModificationException in ZkStateReader.getStateWatchers
        (cherry picked from commit f82c3b1)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 42254c2d9b0e93a96e280b44b2fcbd4b9b6693af in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=42254c2 ] SOLR-9405 : ConcurrentModificationException in ZkStateReader.getStateWatchers (cherry picked from commit f82c3b1)
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Thanks Alan and Edward. I folded Alan's changes to use compute in getStateWatchers into the patch/commit.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Thanks Alan and Edward. I folded Alan's changes to use compute in getStateWatchers into the patch/commit.
        Hide
        dsmiley David Smiley added a comment -

        Not sure if it's related to this issue, but when I beast test TestCollectionStateWatchers, I get "ZkTestServer Watch limit violations".
        ant beast -Dtestcase=TestCollectionStateWatchers -Dbeast.iters=100 It only took a few minutes.

        Separately from that, I find the use of ConcurrentHashMap.compute() in this class very odd... definitely not the intended purpose of this method – even the return value isn't used. I think it's a nifty method/feature, it's just that I don't see why ZkTestServer is calling it the way it's calling it versus simply calling get(). Attached is a patch showing something simpler. My beast test failed with and without this patch.

        Show
        dsmiley David Smiley added a comment - Not sure if it's related to this issue, but when I beast test TestCollectionStateWatchers, I get "ZkTestServer Watch limit violations". ant beast -Dtestcase=TestCollectionStateWatchers -Dbeast.iters=100 It only took a few minutes. Separately from that, I find the use of ConcurrentHashMap.compute() in this class very odd... definitely not the intended purpose of this method – even the return value isn't used. I think it's a nifty method/feature, it's just that I don't see why ZkTestServer is calling it the way it's calling it versus simply calling get() . Attached is a patch showing something simpler. My beast test failed with and without this patch.
        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

          People

          • Assignee:
            shalinmangar Shalin Shekhar Mangar
            Reporter:
            shalinmangar Shalin Shekhar Mangar
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development