[IGNITE-10808] Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.7
Fix Version/s: 2.8
Component/s: None
Labels:
- discovery

Ignite Flags:

Docs Required

Description

A node receives a new metrics update message every `metricsUpdateFrequency` milliseconds, and the message will be put at the top of the queue (because it is a high priority message).
If processing one message takes more than `metricsUpdateFrequency` then multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long enough delay (e.g. caused by a network glitch or GC) may lead to the queue building up tens of metrics update messages which are essentially useless to be processed. Finally, if processing a message on average takes a little more than `metricsUpdateFrequency` (even for a relatively short period of time, say, for a minute due to network issues) then the message worker will end up processing only the metrics updates and the cluster will essentially hang.

Reproducer is attached. In the test, the queue first builds up and then very slowly being teared down, causing "Failed to wait for PME" messages.

Need to change ServerImpl's SocketReader not to put another metrics update message to the top of the queue if it already has one (or replace the one at the top with new one).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

IgniteMetricsOverflowTest.java
24/Dec/18 17:06
4 kB
Stanislav Lukyanov

Issue Links

links to

GitHub Pull Request #5771

Activity

People

Assignee:: Denis Mekhanikov

Reporter:: Stanislav Lukyanov

Reviewer:: Sergey Chugunov

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 24/Dec/18 17:07

Updated:: 23/Aug/19 10:23

Resolved:: 23/Aug/19 10:23

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10m