Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10420

Solr 6.x leaking one SolrZkClient instance per second

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.5.2, 6.4.2, 6.5
    • Fix Version/s: 5.5.5, 5.6, 6.5.1, 6.6, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      One of our nodes became berzerk after a restart, Solr went completely nuts! So i opened VisualVM to keep an eye on it and spotted a different problem that occurs in all our Solr 6.4.2 and 6.5.0 nodes.

      It appears Solr is leaking one SolrZkClient instance per second via DistributedQueue$ChildWatcher. That one per second is quite accurate for all nodes, there are about the same amount of instances as there are seconds since Solr started. I know VisualVM's instance count includes objects-to-be-collected, the instance count does not drop after a forced garbed collection round.

      It doesn't matter how many cores or collections the nodes carry or how heavy traffic is.

      1. SOLR-10420.patch
        3 kB
        Cao Manh Dat
      2. OverseerTest.80.stdout
        239 kB
        Steve Rowe
      3. SOLR-10420.patch
        5 kB
        Cao Manh Dat
      4. SOLR-10420.patch
        3 kB
        Cao Manh Dat
      5. OverseerTest.106.stdout
        223 kB
        Steve Rowe
      6. OverseerTest.119.stdout
        329 kB
        Steve Rowe
      7. OverseerTest.DEBUG.43.stdout
        820 kB
        Steve Rowe
      8. OverseerTest.DEBUG.48.stdout
        704 kB
        Steve Rowe
      9. OverseerTest.DEBUG.58.stdout
        759 kB
        Steve Rowe
      10. SOLR-10420.patch
        4 kB
        Cao Manh Dat
      11. SOLR-10420.patch
        5 kB
        Cao Manh Dat
      12. SOLR-10420-dragonsinth.patch
        7 kB
        Scott Blum

        Activity

        Hide
        ichattopadhyaya Ishan Chattopadhyaya added a comment - - edited

        Nothing changed in that code in the last few releases. Do you know if this worked fine in a prior 6x release?
        FYI, Scott Blum and Shalin Shekhar Mangar <-- experts in that code.

        Show
        ichattopadhyaya Ishan Chattopadhyaya added a comment - - edited Nothing changed in that code in the last few releases. Do you know if this worked fine in a prior 6x release? FYI, Scott Blum and Shalin Shekhar Mangar <-- experts in that code.
        Hide
        markus17 Markus Jelsma added a comment -

        I only have 6.5.0 and a not-yet upgraded 6.4.2, both suffer the same.

        But i just built a 6.3.0, ran it in cloud mode without registering a collection or core using the built-in Zookeeper. After two minutes, i had ~120 client objects, now i have more.

        6.0.0 doesn't show increased instance counts. Can't test 6.1 and 6.2, ant keeps hanging on resolve for whatever reason.

        Show
        markus17 Markus Jelsma added a comment - I only have 6.5.0 and a not-yet upgraded 6.4.2, both suffer the same. But i just built a 6.3.0, ran it in cloud mode without registering a collection or core using the built-in Zookeeper. After two minutes, i had ~120 client objects, now i have more. 6.0.0 doesn't show increased instance counts. Can't test 6.1 and 6.2, ant keeps hanging on resolve for whatever reason.
        Hide
        ichattopadhyaya Ishan Chattopadhyaya added a comment -

        The ant resolve could hang due to lock files. You could try this: find ~ -name "*lck" | xargs rm.

        Show
        ichattopadhyaya Ishan Chattopadhyaya added a comment - The ant resolve could hang due to lock files. You could try this: find ~ -name "*lck" | xargs rm .
        Hide
        markus17 Markus Jelsma added a comment -

        Ah, i found it, the problem appeared in 6.1.0. Versions 6.0.0 and 6.0.1 do not show this problem, the instances are eaten by GC.

        Show
        markus17 Markus Jelsma added a comment - Ah, i found it, the problem appeared in 6.1.0. Versions 6.0.0 and 6.0.1 do not show this problem, the instances are eaten by GC.
        Hide
        dragonsinth Scott Blum added a comment -

        Hard to see how the problem could be localized to DistributedQueue$ChildWatcher.. it doesn't create any ZkClients, it's passed in from the outside.

        Show
        dragonsinth Scott Blum added a comment - Hard to see how the problem could be localized to DistributedQueue$ChildWatcher.. it doesn't create any ZkClients, it's passed in from the outside.
        Hide
        markus17 Markus Jelsma added a comment -

        Well, actually DistributedQueue$ChildWatcher is being leaked well, so leaking of SolrZkClient could be a consequence of that.

        Show
        markus17 Markus Jelsma added a comment - Well, actually DistributedQueue$ChildWatcher is being leaked well, so leaking of SolrZkClient could be a consequence of that.
        Hide
        markus17 Markus Jelsma added a comment -

        To note another oddity, some nodes of our regular search cluster (6.5.0) do not show increased counts. Some nodes with other roles (but running Solr) show the problem immediately after each restart every time i restarted them today. So it could be 6.0.1 and 6.0.0 also show the problem, although they didn't when i just tested them.

        Show
        markus17 Markus Jelsma added a comment - To note another oddity, some nodes of our regular search cluster (6.5.0) do not show increased counts. Some nodes with other roles (but running Solr) show the problem immediately after each restart every time i restarted them today. So it could be 6.0.1 and 6.0.0 also show the problem, although they didn't when i just tested them.
        Hide
        wunder Walter Underwood added a comment -

        To be clear, these are uncollectable objects?

        Show
        wunder Walter Underwood added a comment - To be clear, these are uncollectable objects?
        Hide
        markus17 Markus Jelsma added a comment - - edited

        So it seems. Forced GC does not remove the object instances in >= 6.1.0. In 6.0.x regular GC and forced GC does remove the instances from the object count. I think almost everyone should be able to see it for themselves, almost all our Solr instances show this problem immediately after restart, some don't in some occasions.

        Although they don't consume a lot of bytes, the problem appears to cause more CPU time being used up.

        Filtering the memory sampler for org.apache.solr.common reveals it right away. Also, the number of instances should correspond the the number of seconds the instance is running. A node running how about six days has roughly 500k instances, one running roughly 30 minutes just below 2k.

        Show
        markus17 Markus Jelsma added a comment - - edited So it seems. Forced GC does not remove the object instances in >= 6.1.0. In 6.0.x regular GC and forced GC does remove the instances from the object count. I think almost everyone should be able to see it for themselves, almost all our Solr instances show this problem immediately after restart, some don't in some occasions. Although they don't consume a lot of bytes, the problem appears to cause more CPU time being used up. Filtering the memory sampler for org.apache.solr.common reveals it right away. Also, the number of instances should correspond the the number of seconds the instance is running. A node running how about six days has roughly 500k instances, one running roughly 30 minutes just below 2k.
        Hide
        caomanhdat Cao Manh Dat added a comment -

        Patch for this ticket. This problem was introduced by SOLR-9191. Serious problem for Solr 6.x

        Show
        caomanhdat Cao Manh Dat added a comment - Patch for this ticket. This problem was introduced by SOLR-9191 . Serious problem for Solr 6.x
        Hide
        markus17 Markus Jelsma added a comment -

        This patch appears to solve the problem.

        Show
        markus17 Markus Jelsma added a comment - This patch appears to solve the problem.
        Hide
        steve_rowe Steve Rowe added a comment -

        I ran all Solr tests with the patch on master, and one test failed:

           [junit4]   2> 264992 ERROR (OverseerExitThread) [    ] o.a.s.c.Overseer could not read the data
           [junit4]   2> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/leader
           [junit4]   2>        at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
           [junit4]   2>        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
           [junit4]   2>        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
           [junit4]   2>        at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:356)
           [junit4]   2>        at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:353)
           [junit4]   2>        at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
           [junit4]   2>        at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:353)
           [junit4]   2>        at org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:290)
           [junit4]   2>        at java.lang.Thread.run(Thread.java:745)
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=OverseerTest -Dtests.method=testExternalClusterStateChangeBehavior -Dtests.seed=2110CE0AEF674CFA -Dtests.slow=true -Dtests.locale=es-GT -Dtests.timezone=Asia/Kolkata -Dtests.asserts=true -Dtests.file.encoding=UTF-8
           [junit4] FAILURE 5.46s J12 | OverseerTest.testExternalClusterStateChangeBehavior <<<
           [junit4]    > Throwable #1: java.lang.AssertionError: Illegal state, was: down expected:active clusterState:live nodes:[]collections:{c1=DocCollection(c1//clusterstate.json/2)={
           [junit4]    >   "shards":{"shard1":{
           [junit4]    >       "parent":null,
           [junit4]    >       "range":null,
           [junit4]    >       "state":"active",
           [junit4]    >       "replicas":{"core_node1":{
           [junit4]    >           "base_url":"http://127.0.0.1/solr",
           [junit4]    >           "node_name":"node1",
           [junit4]    >           "core":"core1",
           [junit4]    >           "roles":"",
           [junit4]    >           "state":"down"}}}},
           [junit4]    >   "router":{"name":"implicit"}}, test=LazyCollectionRef(test)}
           [junit4]    >        at __randomizedtesting.SeedInfo.seed([2110CE0AEF674CFA:490ECDE60DF716B4]:0)
           [junit4]    >        at org.apache.solr.cloud.AbstractDistribZkTestBase.verifyReplicaStatus(AbstractDistribZkTestBase.java:273)
           [junit4]    >        at org.apache.solr.cloud.OverseerTest.testExternalClusterStateChangeBehavior(OverseerTest.java:1259)
        

        I ran the repro line a couple of times and it didn't reproduce. I then beasted 100 iterations of the test suite using Miller's beasting script, and it failed once. I'm attaching the test log from the failure.

        Looking at emailed Jenkins reports of testExternalClusterStateChangeBehavior() failing, I see that it was failing almost daily until the day SOLR-9191 was committed (June 9, 2016), and then zero failures since, so this failure seems suspicious to me, since this issue is related to SOLR-9191.

        I beasted 200 iterations of OverseerTest without the patch, and got zero failures.

        Show
        steve_rowe Steve Rowe added a comment - I ran all Solr tests with the patch on master, and one test failed: [junit4] 2> 264992 ERROR (OverseerExitThread) [ ] o.a.s.c.Overseer could not read the data [junit4] 2> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/leader [junit4] 2> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) [junit4] 2> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) [junit4] 2> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) [junit4] 2> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:356) [junit4] 2> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:353) [junit4] 2> at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) [junit4] 2> at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:353) [junit4] 2> at org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:290) [junit4] 2> at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=OverseerTest -Dtests.method=testExternalClusterStateChangeBehavior -Dtests.seed=2110CE0AEF674CFA -Dtests.slow=true -Dtests.locale=es-GT -Dtests.timezone=Asia/Kolkata -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 5.46s J12 | OverseerTest.testExternalClusterStateChangeBehavior <<< [junit4] > Throwable #1: java.lang.AssertionError: Illegal state, was: down expected:active clusterState:live nodes:[]collections:{c1=DocCollection(c1//clusterstate.json/2)={ [junit4] > "shards":{"shard1":{ [junit4] > "parent":null, [junit4] > "range":null, [junit4] > "state":"active", [junit4] > "replicas":{"core_node1":{ [junit4] > "base_url":"http://127.0.0.1/solr", [junit4] > "node_name":"node1", [junit4] > "core":"core1", [junit4] > "roles":"", [junit4] > "state":"down"}}}}, [junit4] > "router":{"name":"implicit"}}, test=LazyCollectionRef(test)} [junit4] > at __randomizedtesting.SeedInfo.seed([2110CE0AEF674CFA:490ECDE60DF716B4]:0) [junit4] > at org.apache.solr.cloud.AbstractDistribZkTestBase.verifyReplicaStatus(AbstractDistribZkTestBase.java:273) [junit4] > at org.apache.solr.cloud.OverseerTest.testExternalClusterStateChangeBehavior(OverseerTest.java:1259) I ran the repro line a couple of times and it didn't reproduce. I then beasted 100 iterations of the test suite using Miller's beasting script, and it failed once. I'm attaching the test log from the failure. Looking at emailed Jenkins reports of testExternalClusterStateChangeBehavior() failing, I see that it was failing almost daily until the day SOLR-9191 was committed (June 9, 2016), and then zero failures since, so this failure seems suspicious to me, since this issue is related to SOLR-9191 . I beasted 200 iterations of OverseerTest without the patch, and got zero failures.
        Hide
        caomanhdat Cao Manh Dat added a comment - - edited

        Steve Rowe I think this is problem of the test. Can you run the test with this patch ( I increased the amount of time waiting for replica become active ). I also run testExternalClusterStateChangeBehavior for 1000 times and can not find any failure.

        Show
        caomanhdat Cao Manh Dat added a comment - - edited Steve Rowe I think this is problem of the test. Can you run the test with this patch ( I increased the amount of time waiting for replica become active ). I also run testExternalClusterStateChangeBehavior for 1000 times and can not find any failure.
        Hide
        caomanhdat Cao Manh Dat added a comment - - edited

        A patch for this ticket. In this patch, we reuse the ChildWatcher (this is Noble idea) so in any case ( race conditions ) we always reach the line

        lastWatcher = null
        
        Show
        caomanhdat Cao Manh Dat added a comment - - edited A patch for this ticket. In this patch, we reuse the ChildWatcher (this is Noble idea) so in any case ( race conditions ) we always reach the line lastWatcher = null
        Hide
        ichattopadhyaya Ishan Chattopadhyaya added a comment - - edited

        Ran the test with original patch as well as 15s timeout patch 250 times each. I saw no failures. I can run this on better hardware early next week (my AMD Ryzen is arriving soon!).

        Show
        ichattopadhyaya Ishan Chattopadhyaya added a comment - - edited Ran the test with original patch as well as 15s timeout patch 250 times each. I saw no failures. I can run this on better hardware early next week (my AMD Ryzen is arriving soon!).
        Hide
        steve_rowe Steve Rowe added a comment -

        I beasted the latest patch (the one without the increased OverseerTest timeouts) for 200 iterations and got 2 failures - I've attached their logs: OverseerTest.106.stdout and OverseerTest.119.stdout.

        Next I'll try the patch with the OverseerTest changes.

        Show
        steve_rowe Steve Rowe added a comment - I beasted the latest patch (the one without the increased OverseerTest timeouts) for 200 iterations and got 2 failures - I've attached their logs: OverseerTest.106.stdout and OverseerTest.119.stdout . Next I'll try the patch with the OverseerTest changes.
        Hide
        steve_rowe Steve Rowe added a comment - - edited

        Next I'll try the patch with the OverseerTest changes.

        I got 4 failures from 100 beasting iterations using this version of the patch. The failures looked the same as the previously posted logs, so I won't add them.

        Next I'll try beasting the latest patch again, this time with DEBUG level on org.apache.solr.cloud.

        Show
        steve_rowe Steve Rowe added a comment - - edited Next I'll try the patch with the OverseerTest changes. I got 4 failures from 100 beasting iterations using this version of the patch. The failures looked the same as the previously posted logs, so I won't add them. Next I'll try beasting the latest patch again, this time with DEBUG level on org.apache.solr.cloud .
        Hide
        steve_rowe Steve Rowe added a comment -

        I got 5 failures out of 100, attaching 3 of them here: OverseerTest.DEBUG.43.stdout, OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout

        Show
        steve_rowe Steve Rowe added a comment - I got 5 failures out of 100, attaching 3 of them here: OverseerTest.DEBUG.43.stdout , OverseerTest.DEBUG.48.stdout , OverseerTest.DEBUG.58.stdout
        Hide
        dragonsinth Scott Blum added a comment -

        Fix LGTM.

        Is this actual fix this?

        ```
        // we're not in a dirty state, and we do not have in-memory children
        if (lastWatcher != null) return null;
        ```

        IE, if you just do that, would that fix the leak even without reusing the same object?

        Show
        dragonsinth Scott Blum added a comment - Fix LGTM. Is this actual fix this? ``` // we're not in a dirty state, and we do not have in-memory children if (lastWatcher != null) return null; ``` IE, if you just do that, would that fix the leak even without reusing the same object?
        Hide
        caomanhdat Cao Manh Dat added a comment -

        It's actually fix the problem even without reusing the same object. But It makes the OverseerTest fail.

        Show
        caomanhdat Cao Manh Dat added a comment - It's actually fix the problem even without reusing the same object. But It makes the OverseerTest fail.
        Hide
        caomanhdat Cao Manh Dat added a comment - - edited

        I kinda find out the reason why the test failure. There are some notice here

        • In current DQ version, for each time we peek() if in-memory queue is empty, we will actually look at the ZK to get new elements ( watcher are useless in this scenario )
        • With the patch, for each time we peek() if in-memory queue is empty, we will only look at the ZK nodes when watcher tell us that there are change in our queue. So If ZK already contain new items we may still return empty.

        So this is the reason why the test failure

        • overseer.queue <- set a replica down
        • overseer run the command successfully
        • overseer.queue <- set a replica active
        • overseer delay this command ( overseer.workqueue <- set a replica active )
        • touch /clusterstate.json to change its version
        • overseer.queue <- some ZKWriteCommand, let's call this one ZK1
        • overseer change the clusterstate to set replica active
        • overseer meet badversion exception
        • overseer fetch last element from overseer.workqueue. Here are where problem happen, overseer.workqueue.peek() return empty because the watcher is not fired.
        • overseer process ZK1, it success -> overseer.workqueue is emptied.
        Show
        caomanhdat Cao Manh Dat added a comment - - edited I kinda find out the reason why the test failure. There are some notice here In current DQ version, for each time we peek() if in-memory queue is empty, we will actually look at the ZK to get new elements ( watcher are useless in this scenario ) With the patch, for each time we peek() if in-memory queue is empty, we will only look at the ZK nodes when watcher tell us that there are change in our queue. So If ZK already contain new items we may still return empty. So this is the reason why the test failure overseer.queue <- set a replica down overseer run the command successfully overseer.queue <- set a replica active overseer delay this command ( overseer.workqueue <- set a replica active ) touch /clusterstate.json to change its version overseer.queue <- some ZKWriteCommand, let's call this one ZK1 overseer change the clusterstate to set replica active overseer meet badversion exception overseer fetch last element from overseer.workqueue. Here are where problem happen, overseer.workqueue.peek() return empty because the watcher is not fired. overseer process ZK1, it success -> overseer.workqueue is emptied.
        Hide
        caomanhdat Cao Manh Dat added a comment -

        I'm thinking about two different way to solve this problem :
        1. DQ will set lastWatcher = null when we run DQ.offer(byte[] data) sucessfully. Because the overseer.workqueue is locally offered by overseer so that will fix the problem. But we changed DQ.peek() from true positive ( if ZK contain new item, we will return that ) to false positive ( if ZK contain new item, we may not return that ) so this may inflict other parts as well.
        2. each time DQ.peek() is called we will look at the ZK nodes without using the watcher.

        Show
        caomanhdat Cao Manh Dat added a comment - I'm thinking about two different way to solve this problem : 1. DQ will set lastWatcher = null when we run DQ.offer(byte[] data) sucessfully. Because the overseer.workqueue is locally offered by overseer so that will fix the problem. But we changed DQ.peek() from true positive ( if ZK contain new item, we will return that ) to false positive ( if ZK contain new item, we may not return that ) so this may inflict other parts as well. 2. each time DQ.peek() is called we will look at the ZK nodes without using the watcher.
        Hide
        caomanhdat Cao Manh Dat added a comment -

        I have a discussion with Noble. It seems that DQ are not used in any places except Overseer. So I will go with solution #1.

        Will beast the test in Steve machine tonight ( thanks Steve Rowe a lot )

        Show
        caomanhdat Cao Manh Dat added a comment - I have a discussion with Noble. It seems that DQ are not used in any places except Overseer. So I will go with solution #1. Will beast the test in Steve machine tonight ( thanks Steve Rowe a lot )
        Hide
        caomanhdat Cao Manh Dat added a comment -

        Scott Blum Can you review the patch?

        Show
        caomanhdat Cao Manh Dat added a comment - Scott Blum Can you review the patch?
        Hide
        dragonsinth Scott Blum added a comment -

        Let me try to unpack what you said..

        1) We want a synchronous offer() -> peek() on the same thread to return the item offered without delay.
        2) This works on master, but the original patch to fix the leak breaks #1.

        Is that correct?

        Let me look at this on Monday with Joshua Humphries. I'm pretty sure there's a simplification to be made in DQ with how we're handling the watcher and dirty tracking. There used to be an explicit "isDirty" bit that we traded out for watcher nullability, which in retrospect I'm not sure was the best choice.

        Show
        dragonsinth Scott Blum added a comment - Let me try to unpack what you said.. 1) We want a synchronous offer() -> peek() on the same thread to return the item offered without delay. 2) This works on master, but the original patch to fix the leak breaks #1. Is that correct? Let me look at this on Monday with Joshua Humphries . I'm pretty sure there's a simplification to be made in DQ with how we're handling the watcher and dirty tracking. There used to be an explicit "isDirty" bit that we traded out for watcher nullability, which in retrospect I'm not sure was the best choice.
        Hide
        caomanhdat Cao Manh Dat added a comment -

        Scott Blum that's correct.

        The last patch pass all the tests.

        Show
        caomanhdat Cao Manh Dat added a comment - Scott Blum that's correct. The last patch pass all the tests.
        Hide
        caomanhdat Cao Manh Dat added a comment -

        Scott Blum As Steve said, Overseer.external.. test failed frequently before SOLR-9191 got committed. So I doubt that "isDirty" in retro-code will fix the problem.

        Show
        caomanhdat Cao Manh Dat added a comment - Scott Blum As Steve said, Overseer.external.. test failed frequently before SOLR-9191 got committed. So I doubt that "isDirty" in retro-code will fix the problem.
        Hide
        caomanhdat Cao Manh Dat added a comment -

        Latest patch, I would like not to reuse lastWatcher. It can come to this case

        peek -> lastWatcher = resuseWatcher (1)
        offer -> lastWatcher = null
        peek -> lastWatcher = reuseWatcher (2)
        (1) event -> lastWatcher = null
        
        Show
        caomanhdat Cao Manh Dat added a comment - Latest patch, I would like not to reuse lastWatcher. It can come to this case peek -> lastWatcher = resuseWatcher (1) offer -> lastWatcher = null peek -> lastWatcher = reuseWatcher (2) (1) event -> lastWatcher = null
        Hide
        dragonsinth Scott Blum added a comment -

        Cao Manh Dat I didn't literally mean that we should bring back the isDirty bit. I meant that clearly the last time around, there was a hole in the design that led to this leak. I want to take the opportunity to re-look at the design again as a whole and make sure everything seems good, and we're not just putting a band-aid on it. You may have already done this, so just give me a little bit to catch up.

        Show
        dragonsinth Scott Blum added a comment - Cao Manh Dat I didn't literally mean that we should bring back the isDirty bit. I meant that clearly the last time around, there was a hole in the design that led to this leak. I want to take the opportunity to re-look at the design again as a whole and make sure everything seems good, and we're not just putting a band-aid on it. You may have already done this, so just give me a little bit to catch up.
        Hide
        dragonsinth Scott Blum added a comment -

        Cao Manh Dat Joshua Humphries I think this may be the right approach after reviewing the overall design. I don't see any real reason to specifically track lastWatcher, we just need to ensure that no more than one is ever set. And having lastWatcher serve double-duty was a misdesign on my part. There are really two separate stateful questions to answer:

        1) Is there a watcher set?
        2) Are we known to be dirty?

        The answer to those two questions is not the same if we want to support same-thread synchronous offer -> poll working as you would want. So this patch tracks them separately.

        Show
        dragonsinth Scott Blum added a comment - Cao Manh Dat Joshua Humphries I think this may be the right approach after reviewing the overall design. I don't see any real reason to specifically track lastWatcher, we just need to ensure that no more than one is ever set. And having lastWatcher serve double-duty was a misdesign on my part. There are really two separate stateful questions to answer: 1) Is there a watcher set? 2) Are we known to be dirty? The answer to those two questions is not the same if we want to support same-thread synchronous offer -> poll working as you would want. So this patch tracks them separately.
        Hide
        caomanhdat Cao Manh Dat added a comment - - edited

        Scott Blum The patch LGTM!
        I think this is good to reuse the watcher in your patch.

        Show
        caomanhdat Cao Manh Dat added a comment - - edited Scott Blum The patch LGTM! I think this is good to reuse the watcher in your patch.
        Hide
        caomanhdat Cao Manh Dat added a comment -

        Scott Blum This issue is blocker for Solr 6.5.1. So I think we (you) should commit soon.

        Show
        caomanhdat Cao Manh Dat added a comment - Scott Blum This issue is blocker for Solr 6.5.1. So I think we (you) should commit soon.
        Hide
        dragonsinth Scott Blum added a comment - - edited

        Agreed. It passes for me. Anyone on this issue want to do any extensive testing of https://issues.apache.org/jira/secure/attachment/12863715/SOLR-10420-dragonsinth.patch before I commit? Otherwise I'll commit this today to master and then start backporting it to a number of branches.

        Show
        dragonsinth Scott Blum added a comment - - edited Agreed. It passes for me. Anyone on this issue want to do any extensive testing of https://issues.apache.org/jira/secure/attachment/12863715/SOLR-10420-dragonsinth.patch before I commit? Otherwise I'll commit this today to master and then start backporting it to a number of branches.
        Hide
        steve_rowe Steve Rowe added a comment -

        I did 500 beasting iterations of OverseerTest using your latest patch Scott Blum, zero failures.

        I'm running the full Solr core test suite a couple times now, I'll report back when it finishes (should be less than half an hour).

        Show
        steve_rowe Steve Rowe added a comment - I did 500 beasting iterations of OverseerTest using your latest patch Scott Blum , zero failures. I'm running the full Solr core test suite a couple times now, I'll report back when it finishes (should be less than half an hour).
        Hide
        steve_rowe Steve Rowe added a comment -

        I'm running the full Solr core test suite a couple times now, I'll report back when it finishes (should be less than half an hour).

        I ran the solr-core and solrj suites three times each with Scott Blum's latest patch on master, and there were zero failures.

        Show
        steve_rowe Steve Rowe added a comment - I'm running the full Solr core test suite a couple times now, I'll report back when it finishes (should be less than half an hour). I ran the solr-core and solrj suites three times each with Scott Blum 's latest patch on master, and there were zero failures.
        Hide
        dragonsinth Scott Blum added a comment -

        Great! Thanks Steve Rowe! I'll get this committed.

        Show
        dragonsinth Scott Blum added a comment - Great! Thanks Steve Rowe ! I'll get this committed.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 43c2b2320dcf344c42086ceb782e0fc53c439952 in lucene-solr's branch refs/heads/master from Scott Blum
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=43c2b23 ]

        SOLR-10420: fix watcher leak in DistributedQueue

        Show
        jira-bot ASF subversion and git services added a comment - Commit 43c2b2320dcf344c42086ceb782e0fc53c439952 in lucene-solr's branch refs/heads/master from Scott Blum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=43c2b23 ] SOLR-10420 : fix watcher leak in DistributedQueue
        Hide
        dragonsinth Scott Blum added a comment -

        So my plan is to commit this to master, branch_6x, and branch_5x, and let the release managers pull it into the actual release branches. SG?

        Show
        dragonsinth Scott Blum added a comment - So my plan is to commit this to master, branch_6x, and branch_5x, and let the release managers pull it into the actual release branches. SG?
        Hide
        steve_rowe Steve Rowe added a comment - - edited

        So my plan is to commit this to master, branch_6x, and branch_5x, and let the release managers pull it into the actual release branches. SG?

        I think at a minimum you should also commit to branch_6_5, since it's definitely going to be released, and you'll be better equipped to handle conflicts if there are any.

        I noticed you set fixVersion to releases on branches you won't be committing to: 6.4.3, 5.5.5. I don't think you should do that. If you're going to commit to branch_5x, then the fixVersion would be 5.6, since that's the next release on that branch. Unless you commit to branch_6_4, you shouldn't include 6.4.3 as a fixVersion.

        More generally, if you think it's essential that any X.Y.Z release includes this fix, i.e. that it would be a mistake to release without it, then you should commit to the branch from which that release will be made. Otherwise you and others may/will forget to backport when such a release materializes.

        Show
        steve_rowe Steve Rowe added a comment - - edited So my plan is to commit this to master, branch_6x, and branch_5x, and let the release managers pull it into the actual release branches. SG? I think at a minimum you should also commit to branch_6_5, since it's definitely going to be released, and you'll be better equipped to handle conflicts if there are any. I noticed you set fixVersion to releases on branches you won't be committing to: 6.4.3, 5.5.5. I don't think you should do that. If you're going to commit to branch_5x, then the fixVersion would be 5.6, since that's the next release on that branch. Unless you commit to branch_6_4, you shouldn't include 6.4.3 as a fixVersion. More generally, if you think it's essential that any X.Y.Z release includes this fix, i.e. that it would be a mistake to release without it, then you should commit to the branch from which that release will be made. Otherwise you and others may/will forget to backport when such a release materializes.
        Hide
        ichattopadhyaya Ishan Chattopadhyaya added a comment -

        I think you should do master, branch_6x and branch_6_5. branch_5x is optional.

        Show
        ichattopadhyaya Ishan Chattopadhyaya added a comment - I think you should do master, branch_6x and branch_6_5. branch_5x is optional.
        Hide
        dragonsinth Scott Blum added a comment -

        Got it. I do think it would be a mistake. In that case, after I've committed to 5x and 6x, I'll also commit to 6_5 and 5_5.

        Show
        dragonsinth Scott Blum added a comment - Got it. I do think it would be a mistake. In that case, after I've committed to 5x and 6x, I'll also commit to 6_5 and 5_5.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit ae55dfc10fd3843d35df9096b7626aad36735670 in lucene-solr's branch refs/heads/branch_6x from Scott Blum
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ae55dfc ]

        SOLR-10420: fix watcher leak in DistributedQueue

        Show
        jira-bot ASF subversion and git services added a comment - Commit ae55dfc10fd3843d35df9096b7626aad36735670 in lucene-solr's branch refs/heads/branch_6x from Scott Blum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ae55dfc ] SOLR-10420 : fix watcher leak in DistributedQueue
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 42d08dd28c6609a2c70a691e6a88725c9aa31377 in lucene-solr's branch refs/heads/branch_5x from Scott Blum
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=42d08dd ]

        SOLR-10420: fix watcher leak in DistributedQueue

        Show
        jira-bot ASF subversion and git services added a comment - Commit 42d08dd28c6609a2c70a691e6a88725c9aa31377 in lucene-solr's branch refs/heads/branch_5x from Scott Blum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=42d08dd ] SOLR-10420 : fix watcher leak in DistributedQueue
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit e3beb61a72efbce37710ce3cc48b24093070d052 in lucene-solr's branch refs/heads/branch_6_5 from Scott Blum
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e3beb61 ]

        SOLR-10420: fix watcher leak in DistributedQueue

        Show
        jira-bot ASF subversion and git services added a comment - Commit e3beb61a72efbce37710ce3cc48b24093070d052 in lucene-solr's branch refs/heads/branch_6_5 from Scott Blum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e3beb61 ] SOLR-10420 : fix watcher leak in DistributedQueue
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 89beee8d61346d50dbbf02f0cc9cfc5032e46eee in lucene-solr's branch refs/heads/branch_5_5 from Scott Blum
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=89beee8 ]

        SOLR-10420: fix watcher leak in DistributedQueue

        Show
        jira-bot ASF subversion and git services added a comment - Commit 89beee8d61346d50dbbf02f0cc9cfc5032e46eee in lucene-solr's branch refs/heads/branch_5_5 from Scott Blum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=89beee8 ] SOLR-10420 : fix watcher leak in DistributedQueue
        Hide
        markus17 Markus Jelsma added a comment -

        Thanks! Great work!

        Show
        markus17 Markus Jelsma added a comment - Thanks! Great work!

          People

          • Assignee:
            dragonsinth Scott Blum
            Reporter:
            markus17 Markus Jelsma
          • Votes:
            1 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development