Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
8.7
Description
The config range, 50% to 95%, assumes that the circuit breaker is triggered by a CPU utilization metric that goes from 0% to 100%. But the code uses the metric OperatingSystemMXBean.getSystemLoadAverage(). That is an average of the count of processes waiting to run. It is effectively unbounded. I've seen it as high as 50 to 100. It is not bound by 1.0 (100%).
A good limit for load average would need to be aware of the number of CPUs available to the JVM. A load average of 8 is no problem for a 32 CPU host. It is a critical situation for a 2 CPU host.
Also, load average is a Unix OS metric. I don't know if it is even available on Windows.
Instead, use a CPU utilization metric that goes from 0.0 to 1.0. A good choice is OperatingSystemMXBean.getSystemCPULoad(). This name also uses "load", but it is a usage metric.
From the Javadoc:
> Returns the "recent cpu usage" for the whole system. This value is a double in the [0.0,1.0] interval. A value of 0.0 means that all CPUs were idle during the recent period of time observed, while a value of 1.0 means that all CPUs were actively running 100% of the time during the recent period being observed. All values betweens 0.0 and 1.0 are possible depending of the activities going on in the system. If the system recent cpu usage is not available, the method returns a negative value.
Also update the documentation to explain which JMX metrics are used for the memory and CPU circuit breakers.
Attachments
Attachments
- 0001-SOLR-15056-Circuit-Breakers-use-CPU-utilization-inst.patch
- 17 kB
- Walter Underwood
- 0002-SOLR-15056-clean-up-linkage-to-SolrCore-add-back-loa.patch
- 31 kB
- Walter Underwood
- SOLR-15056.patch
- 17 kB
- Walter Underwood
Issue Links
- is related to
-
SOLR-14615 CPU Based Circuit Breaker
- Resolved
- links to
Activity
Looking at the code, it should be ok to test for it and attempt to use it if we have it, but I don't think we can have any guarantees.
Also, I think precommit will complain about com.sun classes, and I'm not sure how to make it ignore that, maybe a SupressForbidden annotation.
The factory will return null if that MXBean isn't available, and I added a null check.
Maybe I should take the current implementation and make it a LoadAverageCircuitBreaker in addition to this. That metric is always available, oddly. That seems like it would be far more Unix-specific than CPU utilization.
systemCpuLoad is already supported and returned as one of the metrics. This comes from the (somewhat convoluted) code in MetricUtils.addMxBeanMetrics where it tries to use all known implementations and accumulates any unique bean properties that they expose.
For example:
http://localhost:8983/solr/admin/metrics?group=jvm&prefix=os { "responseHeader": { "status": 0, "QTime": 1 }, "metrics": { "solr.jvm": { "os.arch": "x86_64", "os.availableProcessors": 12, "os.committedVirtualMemorySize": 8402419712, "os.freePhysicalMemorySize": 41504768, "os.freeSwapSpaceSize": 804519936, "os.maxFileDescriptorCount": 8192, "os.name": "Mac OS X", "os.openFileDescriptorCount": 195, "os.processCpuLoad": 0.0017402379609634876, "os.processCpuTime": 10492010000, "os.systemCpuLoad": 0.1268950796343933, "os.systemLoadAverage": 4.00439453125, "os.totalPhysicalMemorySize": 34359738368, "os.totalSwapSpaceSize": 7516192768, "os.version": "10.16" } } }
The equivalent Java code would be something like this:
coreContainer .getSolrMetricManager() .registry(“solr.jvm”) .getMetrics() .get(“os.systemCpuLoad”)
Mostly passes "./gradlew check -Pvalidation.git.failOnModified=false" on master. Some probably unrelated cloud tests fail.
I've added a patch.
Next, I'm setting up some functional testing. I'll fire up a server, configure a circuit breaker, then start using up system CPU with "yes | wc &". That will close to max-out two processors. At some point, searches should start returning 503.
This change is not needed and it makes the API ugly and more difficult to unit test:
- public boolean checkAnyTripped() { + public boolean checkAnyTripped(SolrCore core) {
The CircuitBreakerManager is created in SolrCore so its life-cycle is the same as SolrCore. This API misleadingly suggests that you could use it with any arbitrary SolrCore.
Instead, the manager can already hold onto a reference to SolrCore when its created, and use it to construct specific breaker implementations - in this case, you could pass SolrCore to the constructor of CPUCircuitBreaker, or make all classes that need it implement SolrCoreAware and initialize them in the manager's ctor in a uniform fashion.
Hi Walter,
Thanks for tackling this and apologies for the delay in response.
I took a look at the patch and here are my comments:
1. Majority of the changes in the patch are to make CircuitBreaker aware of the SolrCore that is being used – which is a cyclic dependency since SolrCore owns CircuitBreakerManager which in turn owns CircuitBreakers. Also, the API becomes ugly, so please fix this change.
2. If you look at the original discussion around this metric, the com.sun package wasn't used due to it being a specific implementation. I agree that it is a cleaner metric to use, but we cannot be depending on something that isn't available on certain VMs.
3. I am curious to understand your assertion about the 50-95% range – that is specifically for JVM heap usage based circuit breaker, not the CPU circuit breaker. Maybe the documentation needs to be clarified?
4. OperatingSystemMXBean.getSystemLoadAverage() is a good metric because it takes the number of CPUs and the current process queue length into account, averaging it by time. I would want to keep this metric unless it does not work for a specific OS (Windows, like you said). Maybe we can use OperatingSystemMXBean.getSystemCPULoad() as the fallback mechanism if the first one returns -1?
Also, I see a significant number of documentation changes which seem to be unrelated? e.g. "Configuration for JVM heap usage based circuit breaker:" to " Configuration for JVM heap usage circuit breaker:"
Thanks, I'll incorporate these changes about core. I'm still learning the internals.
I simplified the wording throughout the documentation. This should make it easier to read for non-native speakers of English.
atri 2. I understand the metric is not available on a few JVMs. It is available in free JVMs like Amazon Corretto. If someone is running a cluster under heavy load, it is probably worth switching to one of those JVMs.
atri 3. My comment about 50-95% in the original report was based on a misreading of the documentation, sorry.
atri 4. getSystemLoadAverage() is not normalized to the number of processors. It is an unbounded number. From the documentation: "The system load average is the sum of the number of runnable entities queued to the available processors and the number of runnable entities running on the available processors averaged over a period of time." When I vertically scale an AWS instance to one with more processors, I would also need to update all of the circuit breaker configs!
Load average has been an unbounded number for as long as I've been using Unix, forty years.
In my experience, load average only gets large after the system has hit 100% CPU. It is not very useful for predicting overload. I'm glad to keep it, but it needs a more accurate name and description.
...there's always this: if you don't know what to choose then make it an option
Seriously, though - Walter is correct that the regular sysLoadAvg is unbounded, it may reach eg. 10's or even 100's, so it's difficult to properly adjust the threshold.
getSystemCpuLoad is supported on Oracle/OpenJDK, Amazon Coretto and IBM J9, I'm not sure about Zulu but I would guess it's supported there too, so indeed it looks like a good default with predictable range. However, we could provide an option to use sysLoadAvg if that's what the user prefers (or if systemCpuLoad is not available) - it just needs a few if-else statements, but most of all it needs SOLID documentation.
Using load average needs a different threshold, so I'd make it a different circuit breaker instead of switching it out.
Copy the existing load average circuit breaker, use a more clear name, and figure out how to document that a limit of 1000 is really a limit of 10.0. Maybe even explain that load average is quite different on different OSes. Linux includes processes in iowait, other Unixes don't.
I'd rather use a threshold of 0.95 to mean 0.95 instead of using 95 to mean 0.95, but that is a breaking change.
getSystemCpuLoad is supported on Oracle/OpenJDK, Amazon Coretto and IBM J9, I'm not sure about Zulu but I would guess it's supported there too, so indeed it looks like a good default with predictable range. However, we could provide an option to use sysLoadAvg if that's what the user prefers (or if systemCpuLoad is not available) - it just needs a few if-else statements, but most of all it needs SOLID documentation.
I would agree to that – the main reason why we did not use systemCpuLoad in the first place was due to it being non standard – and I do not see that invariant changing now. A fallback mechanism is the right way IMO.
Using load average needs a different threshold, so I'd make it a different circuit breaker instead of switching it out.
I do not agree. Honestly, as long as the documentation is clear about what to expect, the current implementation IMHO provides a finer grained control to the user on what limits to expect and to set.
A separate circuit breaker for a different metric seems an over engineering effort and still does not solve the initial problem of the JVMs where this is not supported.
On our production clusters, we have alerts set at 65% CPU utilization. Looking at a recent alert, we had one of the 32 hosts reach 75% CPU. On that host, the load average reached a peak of 32.
75% is a threshold of "75". 32 is a threshold of "3200".
Right now, the CPU utilization is 5% and the load average is 3. So these are very different metrics. Silently replacing one with the other is a violation of the Principle of Least Surprise.
We already have working code for a load average circuit breaker. Make the name more accurate and use it.
There is no way to solve the problem of providing CPU utilization on a JVM that doesn't report CPU utilization.
CPU utilization and load average (run queue length) are complimentary metrics. CPU utilization measures how much work the CPUs are doing. The load average measures how much work is waiting for them (how much they are not doing).
Under 100% CPU, load average doesn't tell us much, but CPU usage is very useful. Over 100% CPU, CPU utilization doesn't tell us much, but load average tells us a lot. It tells us how much work is waiting to run.
Load average spikes after the CPU is very busy, something like 90%. When load average rises, Solr will already be overloaded and service will already be slowing down.
If Solr is running on a system that includes iowait in the load average, then it could be useful to have circuit breakers on both the CPU usage and the load average. The latter would tell when the storage or network is overloaded.
I'm assuming that the hosts are configured with enough RAM so that normal searching doesn't hit disk. That makes Solr CPU-limited. All of our 100+ Solr systems are configured that way.
Under 100% CPU, load average doesn't tell us much, but CPU usage is very useful. Over 100% CPU, CPU utilization doesn't tell us much, but load average tells us a lot. It tells us how much work is waiting to run.
Well said! Indeed these are two very different metrics, and having two separate breaker implementations is not a bad idea (well, we could use one impl + a switch, but that could be too confusing and too easy to make mistakes).
For a load average threshold of 3.0, should the config value be 300 or 3.0?
I'm going with 3.0 for now.
git made two patch files, one per commit. I'm glad to make a single one, but I don't know that git incantation.
Contents of latest patch:
- Moved SolrCore reference to CircuitBreakerManager
- Added back in existing CPU circuit breaker as LoadAverageCircuitBreaker
- Copy-editing on docs: replace "utilization" with "usage", make introductory paragraph more specific, document load average circuit breaker
- Unit tests updated to cover all three circuit breakers
wunder can you please create a pull request? this is a sizeable patch and it would make it easier for someone to review / discuss the changes.
wunder I am looking at this patch again and will be able to look deeper once the PR is available but had a few thoughts:
1. I am not sure what is the bugfix that this patch contains (referring to your email) – can you please elaborate on that?
2. Same for unit tests – did I miss the new tests that you mentioned in one of the patch versions?
3. I am not sure of the documentation changes – I think ctargett 's opinion is the best path here. Can you please split the documentation changes into a different patch so that we can review independently?
1. The bug is that the CPU utilization circuit breaker is not based on CPU utilization. It is based on load average. I've changed that to use CPU utilization and added a new new on that is based on load average.
2. There are now three circuit breakers, all have unit tests. The load average test was moved to the new load average circuit breaker and a new CPU usage test was created for the existing CPU usage breaker.
3. The documentation changes can't really be separated because they match the three circuit breakers in the code. Part of the bug was that the code did not match the documentation. The underlying JMX metrics are now named and linked. Other behavior is called out, for example that the circuit breaker only stops search requests, not update or admin requests. The other changes are simplifying grammar and wording, things like using "usage" instead of "utilization" consistently.
I am sorry, but I have to disagree on some aspects.
- The current circuit breaker publishes what it uses and if the documentation is not appropriate, we should fix it.
- I still do not see what is it that has been fixed in the existing tests – there is a new circuit breaker implementation, there are tests for that (thank you!). Beyond that, does this patch fix anything in existing tests that I am missing?
- I am fine with the elaboration (JMX, search requests) but I am not sure about the grammar and wording, hence I would want Cassandra's opinion there.
In entirety, I agree that the documentation should be extended to highlight more of JMX metrics that we use. The way the framework is built, it is fairly straightforward to implement a new circuit breaker. Hence, I think this patch should be focused on two things: improving documentation and add the new circuit breaker, and the messaging around this Jira should reflect that.
The documentation being part of the patch isn't a problem for me at all; as a project I feel we should prefer doc updates to be made with the code changes in the same patch or pull request. I can't always look at doc changes before they are committed, but whether I do it pre- or post-commit it's always helpful to have all the changes related to an issue together.
These changes look fine to me as long as they reflect the code changes. The grammar changes are improvements IMO, but I can only assume that they accurately reflect the code changes.
Atri, from your perspective, I think you'd want to verify the factual aspects of doc changes. For example, if the docs now say that the circuit breaker tracks the OperatingSystemMXBean.getSystemCpuLoad(), you'd want to check that is true based on the content of the patch. There's no way I have time to track down every doc change into the code, I expect the code reviewers to have done that (and IMO that's another reason why the doc changes being part of the patch/PR is helpful).
I have other minor nits about the doc changes - mainly that I try to put values for parameters in monospace instead of quotes as they are here, but I wouldn't let that hold up committing this if or when it happens to be ready.
Thanks ctargett!
Yes, I have verified the doc changes' accuracy as a part of my initial code review. The doc changes are split into two parts:
1. Updates to the docs for factual information (metrics, requests blocked etc) which are fine.
2. Grammatical changes aimed at simplifying the documentation (e.g. utilisation - usage).
To be clear, it is 2. that I need your help on, since I do not have an opinion on whether those kind of changes make the impact that they target to or not. I was requesting a separate doc patch specifically for changes of type 2, since they are numerous and warrant an independent review IMHO.
ctargett I'm glad to fix this to follow a standard style.
The current documentation does not say which JMX metrics are used by each circuit breaker. I added that documentation.
Some changes document what I found in the code, for example the current documentation says:
If circuit breakers are enabled, requests may be rejected under the condition of high node duress with an appropriate HTTP error code (typically 503).
It is up to the client to handle this error and potentially build a retrial logic as this should ideally be a transient situation.
The updated doc says:
Circuit breakers only interrupt search requests (`SearchHandler`). They are not checked for update requests, admin requests, etc. They are checked for distributed search requests, so they may result in partial failures for multi-shard requests. Circuit breakers are checked once, early in the request evaluation, before significant work is done. Long-running requests will not be interrupted.
Servers rejecting traffic with a 503 code may mislead a load balancer into thinking that they are broken, when they are actually intelligently handling overload. This could cause Solr hosts to be dropped from a load balancer, causing a cascading overload on the remaining hosts. Make sure the load balancer is configured to allow the servers to shed excess load with 503 responses.
There are a number of minor changes, mostly simplifying sentence structure or vocabulary. I've learned to do this so documents are more accessible to non-native English speakers. For example, "condition of high node duress" is replaced with "high node load".
The Wikipedia link is a broken URL, that is also fixed.
Hi. I chatted with Atri and he's OK with me picking up this JIRA. I'll try to dive into PRs and get things based on current main.
Commit 51c1a785c4611d0103f7b73c8adefa028d608bcd in solr's branch refs/heads/main from wrunderwood
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=51c1a785c46 ]
SOLR-15056: add circuit breaker for CPU, fix load circuit breaker (#96)
Co-authored-by: Jan Høydahl <janhoy@users.noreply.github.com>
Commit a1d0938bbe5ca31a1cd2906d912c0a67952de040 in solr's branch refs/heads/branch_9x from wrunderwood
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=a1d0938bbe5 ]
SOLR-15056: add circuit breaker for CPU, fix load circuit breaker (#96)
Co-authored-by: Jan Høydahl <janhoy@users.noreply.github.com>
(cherry picked from commit 51c1a785c4611d0103f7b73c8adefa028d608bcd)
Just realized that I didn't add a line in CHANGES.txt for this. Should that be under Improvements?
Thanks, didn't see that when searching for "15056" earlier. Not sure why.
Is it OK to use com.sun.management.OperatingSystemMXBean instead of the java.lang.management version?
getSystemCpuLoad() is only in the former. The current code uses java.lang.management.OperatingSystemMXBean.