[SOLR-7571] Return metrics with update requests to allow clients to self-throttle - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 4.10.3
Fix Version/s: None
Component/s: None
Labels:
None

Description

I've assigned this to myself to keep track of it, anyone who wants please feel free to take this.

I've recently seen a setup with 10 shards and 4 replicas. The SolrJ client (and post.jar for json files for that matter) firehose updates (150 separate threads in total) at Solr. Eventually, replicas (not leaders) go into recovery and the state cascades and eventually the entire cluster becomes unusable. SOLR-5850 delays the behavior, but it still occurs. There are no errors in the follower's logs this is leader-initiated-recovery because of a timeout.

I think the root problem is that the client is just sending too many requests to the cluster, and ConcurrentUpdateSolrClient/Server (used by the leader to distribute update requests to all the followers) (this was observed in Solr 4.10.3+). I see thread counts of 500+ when this happens.

So assuming that this is the root cause, the obvious "cure" is "don't index that fast". This is unsatisfactory since "that fast" is variable, the only recourse is to set that threshold low enough that the Solr cluster isn't being driven as fast is it can be.

We should provide some mechanism for having the client throttle itself. The number of outstanding update threads is one possibility. The client could then slow down sending updates to Solr.

I'm not sure there's a good way to deal with this on the server. Once the timeout is encountered, you don't know whether the doc has actually been indexed on the follower (actually, in this case it is indexed, it just take a while). Ideally we'd just manage it all magically, but an alternative to let clients dynamically throttle themselves seems do-able.

Attachments

Issue Links

is related to

SOLR-7572 hard commits with waitSearcher=true occasionally returns without waiting leading to inconsistent views of the index.

Resolved

SOLR-7573 Inconsistent numbers of docs between leader and replica

Resolved

SOLR-7344 Allow Jetty thread pool limits while still avoiding distributed deadlock.

Resolved

relates to

SOLR-5850 Race condition in ConcurrentUpdateSolrServer

Open

SOLR-7572 hard commits with waitSearcher=true occasionally returns without waiting leading to inconsistent views of the index.

Resolved

SOLR-7573 Inconsistent numbers of docs between leader and replica

Resolved

(1 relates to)

Activity

People

Assignee:: Erick Erickson

Reporter:: Erick Erickson

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 19/May/15 19:20

Updated:: 20/Jan/16 22:05

Resolved:: 20/Jan/16 22:05