Solr
  1. Solr
  2. SOLR-6146

Leak in CloudSolrServer causing "Too many open files"

    Details

      Description

      Due to a misconfiguration in one of our QA clusters, we uncovered a leak in CloudSolrServer. If this line throws:

      https://github.com/apache/lucene-solr/blob/branch_4x/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrServer.java#L242

      then the instantiated ZkStateReader is leaked.

      Here's the stacktrace of the Exception (we're using a custom build so the line numbers won't quite match up, but it gives the idea):
      at org.apache.solr.client.solrj.impl.CloudSolrServer.connect(CloudSolrServer.java:304) at org.apache.solr.client.solrj.impl.CloudSolrServer.requestWithRetryOnStaleState(CloudSolrServer.java:568) at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:557) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:33) at com.apple.cie.search.client.crossdc.MirroredSolrRequestHandler.handleItem(MirroredSolrRequestHandler.java:100) at com.apple.cie.search.client.crossdc.MirroredSolrRequestHandler.handleItem(MirroredSolrRequestHandler.java:33) at com.apple.coda.queueing.CodaQueueConsumer$StreamProcessor.run(CodaQueueConsumer.java:147) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /live_nodes at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:256) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:253) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:253) at org.apache.solr.common.cloud.ZkStateReader.createClusterStateWatchersAndUpdate(ZkStateReader.java:305) at org.apache.solr.client.solrj.impl.CloudSolrServer.createZkStateReader(CloudSolrServer.java:935) at org.apache.solr.client.solrj.impl.CloudSolrServer.connect(CloudSolrServer.java:298) ... 10 more

      1. SOLR-6146.patch
        2 kB
        Varun Thacker
      2. SOLR-6146.patch
        1 kB
        Varun Thacker
      3. SOLR-6146.patch
        3 kB
        Shalin Shekhar Mangar
      4. SOLR-6146.patch
        3 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        Varun Thacker added a comment -

        Patch which closes ZkStateReader in case of an exception

        Show
        Varun Thacker added a comment - Patch which closes ZkStateReader in case of an exception
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks Varun. There's a failure in CloudSolrServerTest.testShutdown with this patch.

        Jessica Cheng Mallet - Can you tell more about the misconfiguration which caused this error? It might help us to simulate the failure in a test case.

        Show
        Shalin Shekhar Mangar added a comment - Thanks Varun. There's a failure in CloudSolrServerTest.testShutdown with this patch. Jessica Cheng Mallet - Can you tell more about the misconfiguration which caused this error? It might help us to simulate the failure in a test case.
        Hide
        Varun Thacker added a comment -

        Thanks Shalin for reviewing.

        Patch which doesn't break the test. The test explicitly checks for TimeoutException caused during the object creation. And since the old patch used a catch all Exception it failed the instanceof check

        Show
        Varun Thacker added a comment - Thanks Shalin for reviewing. Patch which doesn't break the test. The test explicitly checks for TimeoutException caused during the object creation. And since the old patch used a catch all Exception it failed the instanceof check
        Hide
        Jessica Cheng Mallet added a comment -

        Shalin Shekhar Mangar, we pointed the solrj instance at the wrong zookeeper chroot (where there's no solrcloud pointing and writing to it), and it seems like the code blew up because there's no live_nodes. (See stacktrace in description, "KeeperErrorCode = NoNode for /live_nodes".)

        Show
        Jessica Cheng Mallet added a comment - Shalin Shekhar Mangar , we pointed the solrj instance at the wrong zookeeper chroot (where there's no solrcloud pointing and writing to it), and it seems like the code blew up because there's no live_nodes. (See stacktrace in description, "KeeperErrorCode = NoNode for /live_nodes".)
        Hide
        Shalin Shekhar Mangar added a comment -

        In this patch, I made sure that the zkStateReader is closed on all exceptions while making sure that all exceptions are not wrapped (to preserve back-compat). Varun shared a test which reproduces the problem and it is included with this patch.

        I'll commit this shortly.

        Show
        Shalin Shekhar Mangar added a comment - In this patch, I made sure that the zkStateReader is closed on all exceptions while making sure that all exceptions are not wrapped (to preserve back-compat). Varun shared a test which reproduces the problem and it is included with this patch. I'll commit this shortly.
        Hide
        Shalin Shekhar Mangar added a comment -

        Added a comment to the test to explain how/where it fails exactly.

        Show
        Shalin Shekhar Mangar added a comment - Added a comment to the test to explain how/where it fails exactly.
        Hide
        ASF subversion and git services added a comment -

        Commit 1601621 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1601621 ]

        SOLR-6146: Incorrect configuration such as wrong chroot in zk server address can cause CloudSolrServer to leak resources

        Show
        ASF subversion and git services added a comment - Commit 1601621 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1601621 ] SOLR-6146 : Incorrect configuration such as wrong chroot in zk server address can cause CloudSolrServer to leak resources
        Hide
        ASF subversion and git services added a comment -

        Commit 1601622 from shalin@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1601622 ]

        SOLR-6146: Incorrect configuration such as wrong chroot in zk server address can cause CloudSolrServer to leak resources

        Show
        ASF subversion and git services added a comment - Commit 1601622 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1601622 ] SOLR-6146 : Incorrect configuration such as wrong chroot in zk server address can cause CloudSolrServer to leak resources
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks Jessica and Varun!

        Show
        Shalin Shekhar Mangar added a comment - Thanks Jessica and Varun!
        Hide
        ASF subversion and git services added a comment -

        Commit 1601905 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1601905 ]

        SOLR-6146: Close zk before setting interrupt status

        Show
        ASF subversion and git services added a comment - Commit 1601905 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1601905 ] SOLR-6146 : Close zk before setting interrupt status
        Hide
        ASF subversion and git services added a comment -

        Commit 1601907 from shalin@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1601907 ]

        SOLR-6146: Close zk before setting interrupt status

        Show
        ASF subversion and git services added a comment - Commit 1601907 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1601907 ] SOLR-6146 : Close zk before setting interrupt status

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Jessica Cheng Mallet
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development