[SOLR-10277] On 'downnode', lots of wasteful mutations are done to ZK - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 5.5.3, 5.5.4, 6.0.1, 6.2.1, 6.3, 6.4.2
Fix Version/s: 6.5.1, 7.0
Component/s: SolrCloud
Labels:
- leader
- zookeeper

Description

When a node restarts, it submits a single 'downnode' message to the overseer's state update queue.

When the overseer processes the message, it does way more writes to ZK than necessary. In our cluster of 48 hosts, the majority of collections have only 1 shard and 1 replica. So a single node restarting should only result in ~1/40th of the collections being updated with new replica states (to indicate the node that is no longer active).

However, the current logic in NodeMutator#downNode always updates every collection. So we end up having to do rolling restarts very slowly to avoid having a severe outage due to the overseer having to do way too much work for each host that is restarted. And subsequent shards becoming leader can't get processed until the `downnode` message is fully processed. So a fast rolling restart can result in the overseer queue growing incredibly large and nearly all shards winding up in a leader-less state until that backlog is processed.

The fix is a trivial logic change to only add a ZkWriteCommand for collections that actually have an impacted replica.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-10277.patch
05/Apr/17 07:44
36 kB
Shalin Shekhar Mangar
SOLR-10277.patch
30/Mar/17 06:00
35 kB
Varun Thacker
SOLR-10277-5.5.3.patch
15/Mar/17 13:20
10 kB
Joshua Humphries

Issue Links

is related to

SOLR-7281 Add an overseer action to publish an entire node as 'down'

Closed

relates to

SOLR-10524 Better ZkStateWriter batching

Closed

Activity

People

Assignee:: Scott Blum

Reporter:: Joshua Humphries

Votes:: 3 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 13/Mar/17 21:55

Updated:: 21/Nov/19 00:42

Resolved:: 05/Apr/17 10:37