[SOLR-13975] ConcurrentUpdateSolrClient connection stall prevention - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 8.3, 8.4
Fix Version/s: 8.4
Component/s: None
Labels:
None

Description

When a Solr process, which hosts replicas of a collection, is suspended - that is, the OS process is suspended using eg. kill -STOP <pid> - a long stall may occur in CUSC until a socket timeout is reached.

During this stall updates from the leader are not forwarded to any replica, even though other replicas are still active and can receive updates. If the sender uses CUSC (eg. via CloudSolrClient) then it becomes stalled because the leader stops processing updates, too.

This situation is caused by several issues:

when a process is suspended its sockets remain open - so there is no immediate disconnect as if the process died, but the process becomes unresponsive. Eventually, a socket timeout will be reached (distribUpdateSoTimeout) - but in the default version of solr.xml this is set to 10 min. During this time all indexing to that shard will be stuck.
there are several infinite for loops in CUSC (eg. in blockUntilFinished, waitForEmptyQueue and even in request), which rely either on the relatively quick success of the call or an exception to be thrown. However, in this situation neither happens quickly - the call is stuck waiting for the remote end until soTimeout expires.

This issue proposes to add a stall prevention logic, which would break these infinite loops long before the socket timeout occurs based on the progress of the queue processing.

This is a follow-up to ~~SOLR-13896~~.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-13975.patch
03/Dec/19 17:29
10 kB
Andrzej Bialecki
SOLR-13975.patch
04/Dec/19 20:52
23 kB
Andrzej Bialecki

Activity

People

Assignee:: Andrzej Bialecki

Reporter:: Andrzej Bialecki

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 27/Nov/19 16:58

Updated:: 21/Jan/20 16:57

Resolved:: 12/Dec/19 20:50