Description
The life cycle of Metrics producers are managed by the core (mostly). So, if the lifecycle of the object is different from that of the core itself, these objects will never be unregistered from the metrics registry. This will lead to memory leaks
Attachments
Attachments
- SOLR-13677.patch
- 124 kB
- Andrzej Bialecki
Issue Links
- causes
-
SOLR-13866 DUH2 stats not populated in PluginInfoHandler
- Resolved
- links to
Activity
Noble, is there a reason why you push your feature branch to Apache GIT instead of keeping it in your own fork and do a PR when it’s ready? Do you expect there to be collaboration? Reason I ask is that it adds some “noise” to the lists for every push, merge etc.
What we need is avoid notifications for commits to all the JIRA branches . Using a feature branch is good for collaboration. Every committer automatically had access to your branch
The current patch really complicates the API and the expectations on the implementors. I think we need a different strategy.
How about the following:
- in SolrMetricProducer instead of using AutoCloseable.close() simply add a different default method eg. unregisterMetrics(). Chances are that many of existing components already implement close() so the implementors will not be aware that they need to also unregister the component's metrics.
- I don't think we should expose GaugeWrapper outside of SolrMetricManager, this is really an internal detail. Adding the method above should be sufficient to explicitly unregister the metrics - and all SolrInfoBean-s already know what metric names (and what tag) they registered, because the tag is passed in initializeMetrics and the names are recorded in registerMetricName.
In SolrMetricProducer instead of using AutoCloseable.close() simply add a different default method eg. unregisterMetrics().
I thought of doing it. But the problem is that there is a chance that unregisterMetrics() is not invoked at all, as it is not the responsibility of the metrics framework. AutoCloseable.close() is called anyway. All the cleanup should be performed there. Even if we have an unregisterMetrics() method, we should automatically call it from the AutoCloseable.close() method.
I don't think we should expose GaugeWrapper outside of SolrMetricManager, this is really an internal detail.
I have changed the branch to not expose the GaugeWrapper
Commit f983ede5fc7d14cb004d1e8674b709ffefd8b06e in lucene-solr's branch refs/heads/jira/SOLR-13677 from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f983ede ]
SOLR-13677: Created an interface called GaugeRef instead of exposing the internal class GaugeWrapper
- public void registerGauge(SolrInfoBean info, String registry, Gauge<?> gauge, String tag, boolean force, String metricName, String... metricPath) { + public GaugeRef registerGauge(SolrInfoBean info, String registry, Gauge<?> gauge, String tag, boolean force, String metricName, String... metricPath) {
The above registerGauge method change to make it return a gauge reference encourages but does not ensure that the caller 'remembers' the reference and so that it is later then included in the unregister calls. There's also a small amount of repetition w.r.t. iterating over the myGauges collection and unregistering the elements.
I wonder if some sort of container or wrapper class might be helpful i.e. SolrMetricManager.registerGauge would be sure to call 'remember' for the gauge and the close(?) method of the producer would call the 'forgetAll' method. What do you think?
+ class FooBar { + private final List<GaugeRef> gaugeRefs = new ArrayList<>(); + void remember(GaugeRef gaugeRef) { + gaugeRefs.add(gaugeRef); + } + void forgetAll() { + for (GaugeRef gaugeRef : gaugeRefs) { + gaugeRef.unregister(); + } + gaugeRefs.clear(); + } + } + + public void registerGauge(FooBar memory, SolrInfoBean info, String registry, Gauge<?> gauge, String tag, boolean force, String metricName, String... metricPath) { + memory.remember(registerGauge(info, registry, gauge, tag, force, metricName, metricPath)); + } + + private GaugeRef registerGauge(SolrInfoBean info, String registry, Gauge<?> gauge, String tag, boolean force, String metricName, String... metricPath) { - public void registerGauge(SolrInfoBean info, String registry, Gauge<?> gauge, String tag, boolean force, String metricName, String... metricPath) { ...
Thanks Christine, this looks somewhat cleaner.
For historical reasons (a complicated refactoring of SolrInfoMBean, UI dependencies, JMX, etc) some of the methods one would expect from SolrMetricProducer ended up in SolrInfoBean instead, and this is also the interface that is passed to registerGauge. Unless we want to do a larger refactoring now, we could treat SolrInfoBean as the memory, and add the default methods for remembering and forgetting the gauges to this interface.
Also, GaugeRef can be an interface that GaugeWrapper implements, then there's no need to create even more objects.
cpoerschke, the problem is not with adding those remember() forget() methods. There is no way to ensure that those methods are invoked . We may be able to check them today. But who will ensure that about the future components. The advantage of piggybacking on close() is that it is a well known and usually close is called always
usually close is called always
There's no hard guarantee in any of these scenarios, just a degree of likelihood.
I think that the patch over-complicates things by keeping an explicit reference to the gauges. It should not be needed if we properly use the tag argument when registering gauges.
The tag attribute was added to solve a problem of non-deterministic order of gauge registration during core reload - the new core would already register some of the new gauges but the old core would linger for a while and when it tried to unregister gauges with the same names it would unregister the new ones instead of the old ones.
So the important thing about the tag argument in SolrMetricManager.registerGauge and SolrMetricManager.unregisterGauges is this: the tag represents an object or a group of objects with the same life-cycle. Until now the tag was generated by SolrCore and passed to all its components because they had the same lifecycle. Also, it was SolrCore (via SolrCoreMetricManager) that would call unregisterGauges on behalf of all its components.
Now, if the life-cycle of components is different from that of SolrCore then we need to make sure of two things:
- the tag properly represents the object or group of objects with the same life-cycle - so if eg. SolrCache-s can be re-loaded without reloading SolrCore then they should no longer use the same tag as their parent SolrCore. Instead they should generate their own tags.
- each component must be now responsible for unregistering its own gauges, as identified by its own tag. We can strongly encourage implementors to do this in each component's AutoCloseable.close() but I don't see any easy way to actually enforce it.
This approach doesn't require keeping actual references to gauges in each component.
For convenience we could also extend the concept of tag so that it's multi-valued - eg. a cache would use its own tag and the parent SolrCore-s tag. This way both the cache and the SolrCore could each easily unregister gauges that they (or their parent) created.
the tag properly represents the object or group of objects with the same life-cycle -...
I think this is a better solution. I was not sure about the implications of such a change. I will try to implement it and you can review it
Commit 9b0003a7037206d937b9f4aa48e5dc4cf80fdd0f in lucene-solr's branch refs/heads/jira/SOLR-13677_1 from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9b0003a ]
SOLR-13677: Take 2
Commit 956f61dde9b1ce3fee6b609955d41d0b90c67285 in lucene-solr's branch refs/heads/jira/SOLR-13677_1 from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=956f61d ]
SOLR-13677: minimize changes
Commit 2a7d2df5ff52bb09daf3223c4144aaf6964acdfe in lucene-solr's branch refs/heads/jira/SOLR-13677_2 from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2a7d2df ]
SOLR-13677: new fresh branch
Commit 7415fe453972fb23debd4f570599b5eeb0376ecb in lucene-solr's branch refs/heads/master from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7415fe4 ]
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them (#836)
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them
Commit 7415fe453972fb23debd4f570599b5eeb0376ecb in lucene-solr's branch refs/heads/master from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7415fe4 ]
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them (#836)
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them
Commit 6c94f659d25128614f7241c0c0e4f310e6737e36 in lucene-solr's branch refs/heads/master from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=6c94f65 ]
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them
Commit 1c0490c9ab472327f1267ed46f5ba844de39ae77 in lucene-solr's branch refs/heads/branch_8x from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1c0490c ]
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them (#836)
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them
Commit 1c0490c9ab472327f1267ed46f5ba844de39ae77 in lucene-solr's branch refs/heads/branch_8x from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1c0490c ]
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them (#836)
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them
Commit bad55fd51a0b83d2ef632e72a2d0b5e0cb30f88e in lucene-solr's branch refs/heads/branch_8x from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bad55fd ]
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them
Commit 7415fe453972fb23debd4f570599b5eeb0376ecb in lucene-solr's branch refs/heads/jira/SOLR-13650 from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7415fe4 ]
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them (#836)
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them
Commit 7415fe453972fb23debd4f570599b5eeb0376ecb in lucene-solr's branch refs/heads/jira/SOLR-13650 from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7415fe4 ]
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them (#836)
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them
Commit 6c94f659d25128614f7241c0c0e4f310e6737e36 in lucene-solr's branch refs/heads/jira/SOLR-13650 from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=6c94f65 ]
SOLR-13677: All Metrics Gauges should be unregistered by the objects that registered them
There were several issues that we pointed out in your patch before the last refactoring - and yet after the last refactoring you didn't give anyone a chance to review your changes before committing.
I object to this on principle - knowing that there are issues to be resolved you should've waited a reasonable time for review, and 5 hours is not reasonable, especially during summer holidays. On this grounds alone I'd like to request that you revert these commits.
However, in addition to that there are still several issues remaining in the committed changes that need to be fixed - and it would be just easier and cleaner to do this by reverting, fixing and committing them again after review:
- SolrMetrics is a misleading name - it implies more functionality than it really provides, which is just a dumb context to keep several references together. Previous patch used MetricsInfo which was not ideal but better that this. I propose SolrMetricsContext.
- Also, the class and its methods have 0 javadocs - I hope that one day this will actually become a precommit error.
- looking at usages of this class it seems to me that perhaps it should not include scope - take for example SolrCoreMetricManager.registerMetricProducer (which hasn't been converted yet to the new API!), or any other use of the old SolrMetricProducer.initializeMetrics: it would be so much easier to create just one SolrMetrics object and pass it to each producer, regardless of its scope.
- I don't understand the need to keep a reference SolrMetrics.parent. This is used only in one place in MetricsHandlerTest.RefreshablePluginHolder.closeHandler(), which looks like a hack and has no way of working anyway (comparing hex hashCode to plain int hashCode).
- GaugeWrapper.unregisterGauges: expression wrapper.getTag().contains(tagSegment) already includes tagSegment.equals(wrapper.getTag()).
- I don't understand the conditional in SolrMetricProducer.close() - basically it means that you unregister on close only non-root components and all their children. Why exclude root components?
- SolrMetrics.gauge really doesn't need to mess with explicitly creating a GaugeWrapper, and the new (undocumented) method in SolrMetricManager that you added is dangerous - because it exposes the ability to easily register unwrapped gauges. Instead SolrMetric.gauge should simply delegate to SolrMetricManager.registerGauge. Also, providing a null metricName is an error and should throw an exception - is there any situation where you would want that?
- I don't see any need for the newly added methods in SolrMetricManager - existing methods were sufficient for implementing the functionality in SolrMetrics. Please remove them, they only increase the size of the API without adding any benefit.
- SolrMetricManager.unregisterGauges(String registryName, String tagSegment) should have some javadoc now, because it's not obvious what is a "tagSegment".
- SolrMetricProducer.getUniqueMetricTag has a misleading javadoc comment - it does not produce a metric name but a tag for tying together objects of the same life-cycle.
- Additionally, if parentName.contains(name) should be an error - I don't see when it would be a valid occurrence?
- SolrMetricProducer.initializeMetrics(SolrMetrics) has this peculiar message in the exception: ""This means , the class has not implemented both of these methods". We can do so much better here...
- the separator for tag hierarchy is ":" even though the javadoc claims it's "A.B.C". FWIW I think ":" is a better choice, so just the javadoc needs to be fixed.
- SolrMetricProducer.getMetrics() should not have a default implementation on master, only on branch_8x. Also, the name of the method should reflect the name of the object it returns, whatever we end up with.
- similarly, SolrMetricProducer.initializeMetrics(SolrMetricManager ...) should be deprecated on branch_8x and removed from master.
- PluginBag.put(String, T) should not take care of closing old instances of plugins - instead the corresponding PluginBag.put(String, PluginHolder<T>) (which is also called directly from SolrCore) should do this, otherwise in some cases the old plugin will be closed and in some others it won't. This method should also check for object identity, if the code re-registers the same instance of plugin holder.
- RequestHandlerBase.getMetricRegistry will throw an NPE when called before initializeMetrics is called.
- why does SuggestComponent still call metricsInfo.metricManager.registerGauge instead of metricsInfo.gauge ? BTW, the name of this field should be changed to reflect the name of the SolrMetrics class.
- FastLRUCache.close() and the same method in other caches should simply call SolrMetricProducer.super.close() instead of replicating the logic from that method.
- there are no unit tests for the new functionality of managing gauges with hierarchical metric tags.
- the new field SolrMetrics metrics should be declared together with the declarations of other fields.
- I see that in general many SolrMetricProducer implementations on master still use the old API - they should be converted to the new API if we're serious about using it and deprecating the old one. This can be done in a separate issue.
noble.paul - please revert these changes until the above problems are addressed. I'll be happy to help you fix them - let's create a branch.
let's make the changes/reverts in https://github.com/apache/lucene-solr/tree/jira/SOLR-13677_3.
I marked this as blocker for 8.3 to make sure the changes already committed are fixed before the release.
Reverts should take place right away, else we can easily end up in a bad situation. Development should be worked out on the branch, not in our release branches.
Please let's do a simple and fast revert and then commit when we have consensus.
Yes, lets revert this now then. I appreciate that ab is setting a good bar for quality software!
My plan of action
- I'll revert all the changes I have made as a part of this issue in the branch SOLR-13677_3.
- This will also mean I'll "Ignore" the tests which are failing due to this
- I'll leave it there for a couple of days for you to take any further action. I mean fix the original issue
- After that I'll merge those changes back into master & branch_8x
noble.paul, ab Instead of this multi-day revert plan (which meet seem confusing to understand), I recommend an immediate revert in master and branch_8x.
Yes, I'm planning to do this ASAP.
Why should I be the only one bothered about a memory leak
Commit 042478cfa795dd537dcd4863a0524a73bad9a740 in lucene-solr's branch refs/heads/jira/SOLR-13677_3 from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=042478c ]
SOLR-13677: reverting the last commit
As planned, I have committed the revert to SOLR-13677_3.
I'll give it a few hours. run some testing and merge this to master / branch_8x
Commit 042478cfa795dd537dcd4863a0524a73bad9a740 in lucene-solr's branch refs/heads/master from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=042478c ]
SOLR-13677: reverting the last commit
Commit a288710a64acdde6abc8ce96a0d3b3e18739ac32 in lucene-solr's branch refs/heads/master from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a288710 ]
SOLR-13677: reverting the last commit (#863)
Commit b1bccf7cace424cb895ca6d05b30926697bfe86b in lucene-solr's branch refs/heads/branch_8x from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b1bccf7 ]
SOLR-13677: reverting the last commit
This is a blocker for 8.3, but is still unassigned. ab, based on the private conversations, should I assign it to you?
(Or, noble.paul, should I assign it back to you, since you were attempting the fix?)
How much does it leak in 8.2? Does leak happens under certain condition or action? Is 8.2 prod ready?
ab, assigned to you for further action/guidance. If you think you'll not be able to get to this by 8.3, can I commit Noble's fix?
mkhl no. the leak happens only if you unload a plugin w/o unloading the core. With hot reloading of plugins, it'll be a very common usecase
ichattopadhyaya my objections to the last patch from Noble are still valid, so I would object to committing the same unchanged patch again. And I don't think we should start from scratch - we should just fix the problems in Noble's patch that I listed in my review.
I hope to be able to look at this tomorrow. If other stuff intervenes then I won't be able to work on this until Monday next week.
I hope to be able to look at this tomorrow. If other stuff intervenes then I won't be able to work on this until Monday next week.
Hi ab, could you please get a chance to look at this?
I wasn't sure what was the state of various branches so I created a new one, with the latest patch from Noble: jira/solr-13677-final and I started fixing the issues from review.
This is an updated patch, after cleanup of the issues listed in the review, and with some additional changes:
- I excluded the scope from the context, because in most cases we can reuse the parent context for components with different scopes.
- I converted some internal, non-pluggable components to use the new API.
This still is a large change and needs more testing - I'll need another day to be reasonably sure that it doesn't break things.
Commit f07998fc234c81ff956a84ee508b85f8d573ef38 in lucene-solr's branch refs/heads/master from Andrzej Bialecki
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f07998f ]
SOLR-13677: All Metrics Gauges should be unregistered by components that registered them.
Commit 441af3e7aa5266e7929c93cb5f3abe0746e5c054 in lucene-solr's branch refs/heads/branch_8x from Andrzej Bialecki
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=441af3e ]
SOLR-13677: All Metrics Gauges should be unregistered by components that registered them.
Commit d796eca84dbabe3ae9b3c27afc01ef3bee35acb1 in lucene-solr's branch refs/heads/branch_8_3 from Andrzej Bialecki
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d796eca ]
SOLR-13677: All Metrics Gauges should be unregistered by components that registered them.
Commit b029de191ebdfd50b81f29e3334ccaac46bd70da in lucene-solr's branch refs/heads/branch_8x from Andrzej Bialecki
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b029de1 ]
SOLR-13677: Add a missing override, which resulted in missing metrics (reported by tflobbe).
Commit 2aa586909b911e66e1d8863aa89f173d69f86cd2 in lucene-solr's branch refs/heads/branch_8_3 from Andrzej Bialecki
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2aa5869 ]
SOLR-13677: Add a missing override, which resulted in missing metrics (reported by tflobbe).
Commit 2aa586909b911e66e1d8863aa89f173d69f86cd2 in lucene-solr's branch refs/heads/jira/SOLR-13101 from Andrzej Bialecki
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2aa5869 ]
SOLR-13677: Add a missing override, which resulted in missing metrics (reported by tflobbe).
Commit 3ce75aac49c79a023a9f1519badfe769e6a8f797 in lucene-solr's branch refs/heads/jira/
SOLR-13677from Noble Paul[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3ce75aa ]
SOLR-13677: initial commit