Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11221

SolrJmxReporter broken on core reload

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 6.6, 6.7, 7.0, master (8.0)
    • Fix Version/s: 6.6.1, 7.0, 7.1, master (8.0)
    • Component/s: metrics
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      SolrJmxReporter uses Dropwizard's JmxReporter, which in turn uses a MetricRegistry listener to report newly added metrics as MBeans, and to unregister them from MBeanServer when the reporter is closed, which happens when core is closed.

      The metrics API keeps around existing metrics in solr.core.* registries to help maintain continuous metrics in presence of core reloads. However, this means that some of these metric instances are not registered anew when a core is reloaded - and for these metrics the listener won't fire, so the MBeans won't be registered.

      This limitation is a result of the use of MetricRegistryListener in JmxReporter and can't be fixed without reimplementing this class. Another possible approach would be to configure the JmxReporter to use a "mirroring" registry instead, which will be populated with existing metrics from the original registry (thus generating "metric added" events) and then kept in sync with the main registry via a listener.

      1. SOLR-11221.patch
        36 kB
        Andrzej Bialecki

        Activity

        Hide
        ab Andrzej Bialecki added a comment - - edited

        Hmm, yet another more serious bug is present: SolrJmxReporterTest.testReloadCore doesn't really test the exact sequence of reporters starting and closing, so it doesn't catch the following problem.

        On reload the old core is closed after the new core has been created - and consequently the old instance of SolrJmxReporter removes the mbeans right after they were registered by a new reporter started by the new core. End result is that only a handful of mbeans related to Searcher are registered, other mbeans are gone.

        In the past JmxMonitoredMap used to track what mbean belongs to what core by adding a "coreHashCode" attribute to all DynamicMBean-s, and removing only those with matching coreHashCode attribute. Unfortunately, it's not possible to add additional attributes to MBeans created by Dropwizard's JmxReporter, and on close() it indiscriminately closes all MBeans with known names.

        One possible solution would be to insert reporter's hashCode into the ObjectName hierarchy, but this would make the MBean names variable and unpredictable. Another (more complicated but more elegant) option is to reimplement JmxReporter to re-introduce "coreHashCode" attribute.

        Show
        ab Andrzej Bialecki added a comment - - edited Hmm, yet another more serious bug is present: SolrJmxReporterTest.testReloadCore doesn't really test the exact sequence of reporters starting and closing, so it doesn't catch the following problem. On reload the old core is closed after the new core has been created - and consequently the old instance of SolrJmxReporter removes the mbeans right after they were registered by a new reporter started by the new core. End result is that only a handful of mbeans related to Searcher are registered, other mbeans are gone. In the past JmxMonitoredMap used to track what mbean belongs to what core by adding a "coreHashCode" attribute to all DynamicMBean-s, and removing only those with matching coreHashCode attribute. Unfortunately, it's not possible to add additional attributes to MBeans created by Dropwizard's JmxReporter , and on close() it indiscriminately closes all MBeans with known names. One possible solution would be to insert reporter's hashCode into the ObjectName hierarchy, but this would make the MBean names variable and unpredictable. Another (more complicated but more elegant) option is to reimplement JmxReporter to re-introduce "coreHashCode" attribute.
        Hide
        ab Andrzej Bialecki added a comment -

        This patch changes SolrJmxReporter so that it uses a modified version of JmxReporter, which adds an "_instanceTag" attribute to track what beans it had registered and unregisters only those. In a sense this is the same mechanism that JmxMonitoredMap used, only it was called coreHashCode there (we can call it the same if it gives users a sense of familiarity, but I thought the "_instanceTag" name is harder to mistake for a real bean attribute).

        The new unit test passes with this change, while it was failing with the old implementation. Manual testing with JConsole also shows that metrics are now correctly reported in local and cloud mode, and after core and collection reloads.

        This is unfortunately a bigger change than I hoped for, so I'm not sure whether this should go into 7.0 at such late stage in the release. OTOH JMX monitoring is surely broken without this change, so it can't get much worse

        Show
        ab Andrzej Bialecki added a comment - This patch changes SolrJmxReporter so that it uses a modified version of JmxReporter , which adds an "_instanceTag" attribute to track what beans it had registered and unregisters only those. In a sense this is the same mechanism that JmxMonitoredMap used, only it was called coreHashCode there (we can call it the same if it gives users a sense of familiarity, but I thought the "_instanceTag" name is harder to mistake for a real bean attribute). The new unit test passes with this change, while it was failing with the old implementation. Manual testing with JConsole also shows that metrics are now correctly reported in local and cloud mode, and after core and collection reloads. This is unfortunately a bigger change than I hoped for, so I'm not sure whether this should go into 7.0 at such late stage in the release. OTOH JMX monitoring is surely broken without this change, so it can't get much worse
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 7aa660b747440bfc5beb63051db324db3c5dd761 in lucene-solr's branch refs/heads/master from Andrzej Bialecki
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7aa660b ]

        SOLR-11221: SolrJmxReporter broken on core reload.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 7aa660b747440bfc5beb63051db324db3c5dd761 in lucene-solr's branch refs/heads/master from Andrzej Bialecki [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7aa660b ] SOLR-11221 : SolrJmxReporter broken on core reload.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit ff8b90b9acb7685a77d3b4af14f18aedb681176f in lucene-solr's branch refs/heads/master from Andrzej Bialecki
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ff8b90b ]

        SOLR-11221: Fix precommit issues.

        Show
        jira-bot ASF subversion and git services added a comment - Commit ff8b90b9acb7685a77d3b4af14f18aedb681176f in lucene-solr's branch refs/heads/master from Andrzej Bialecki [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ff8b90b ] SOLR-11221 : Fix precommit issues.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2711dbed8b4b72c37fa929be5167fb6a823c1134 in lucene-solr's branch refs/heads/branch_7x from Andrzej Bialecki
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2711dbe ]

        SOLR-11221: SolrJmxReporter broken on core reload.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2711dbed8b4b72c37fa929be5167fb6a823c1134 in lucene-solr's branch refs/heads/branch_7x from Andrzej Bialecki [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2711dbe ] SOLR-11221 : SolrJmxReporter broken on core reload.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit be596b90696b24c4e73e3e1455f1fb95dcc58b0d in lucene-solr's branch refs/heads/branch_7_0 from Andrzej Bialecki
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=be596b9 ]

        SOLR-11221: SolrJmxReporter broken on core reload.

        Show
        jira-bot ASF subversion and git services added a comment - Commit be596b90696b24c4e73e3e1455f1fb95dcc58b0d in lucene-solr's branch refs/heads/branch_7_0 from Andrzej Bialecki [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=be596b9 ] SOLR-11221 : SolrJmxReporter broken on core reload.
        Hide
        ab Andrzej Bialecki added a comment -

        I'm leaving this issue open for now - let's see if jenkins is happy about this change...

        Show
        ab Andrzej Bialecki added a comment - I'm leaving this issue open for now - let's see if jenkins is happy about this change...
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 54469c7ca5f639a8120d9e4b9e51c0f82ab57b9b in lucene-solr's branch refs/heads/branch_6_6 from Andrzej Bialecki
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=54469c7 ]

        SOLR-11221: SolrJmxReporter broken on core reload.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 54469c7ca5f639a8120d9e4b9e51c0f82ab57b9b in lucene-solr's branch refs/heads/branch_6_6 from Andrzej Bialecki [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=54469c7 ] SOLR-11221 : SolrJmxReporter broken on core reload.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Bulk close after 7.1.0 release

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Bulk close after 7.1.0 release

          People

          • Assignee:
            ab Andrzej Bialecki
            Reporter:
            ab Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development