[SOLR-9824] Documents indexed in bulk are replicated using too many HTTP requests - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 6.3
Fix Version/s: 7.0
Component/s: SolrCloud
Labels:
None

Description

This takes awhile to explain; bear with me. While working on bulk indexing small documents, I looked at the logs of my SolrCloud nodes. I noticed that shards would see an /update log message every ~6ms which is way too much. These are requests from one shard (that isn't a leader/replica for these docs but the recipient from my client) to the target shard leader (no additional replicas). One might ask why I'm not sending docs to the right shard in the first place; I have a reason but it's besides the point – there's a real Solr perf problem here and this probably applies equally to replicationFactor>1 situations too. I could turn off the logs but that would hide useful stuff, and it's disconcerting to me that so many short-lived HTTP requests are happening, somehow at the bequest of DistributedUpdateProcessor. After lots of analysis and debugging and hair pulling, I finally figured it out.

In ~~SOLR-7333~~ (tpot) introduced an optimization called UpdateRequest.isLastDocInBatch() in which ConcurrentUpdateSolrClient will poll with a '0' timeout to the internal queue, so that it can close the connection without it hanging around any longer than needed. This part makes sense to me. Currently the only spot that has the smarts to set this flag is JavaBinUpdateRequestCodec.unmarshal.readOuterMostDocIterator() at the last document. So if a shard received docs in a javabin stream (but not other formats) one would expect the last document to have this flag. There's even a test. Docs without this flag get the default poll time; for javabin it's 25ms. Okay.

I suspect that if someone used CloudSolrClient or HttpSolrClient to send javabin data in a batch, the intended efficiencies of ~~SOLR-7333~~ would apply. I didn't try. In my case, I'm using ConcurrentUpdateSolrClient (and BTW DistributedUpdateProcessor uses CUSC too). CUSC uses the RequestWriter (defaulting to javabin) to send each document separately without any leading marker or trailing marker. For the XML format by comparison, there is a leading and trailing marker (<stream> ... </stream>). Since there's no outer container for the javabin unmarshalling to detect the last document, it marks every document as req.lastDocInBatch()! Ouch!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-9824-tflobbe.patch
16/May/17 20:27
2 kB
Tomas Eduardo Fernandez Lobbe
SOLR-9824.patch
28/Dec/16 13:01
33 kB
Mark Miller
SOLR-9824.patch
12/Dec/16 01:09
33 kB
Mark Miller
SOLR-9824.patch
11/Dec/16 20:06
27 kB
Mark Miller
SOLR-9824.patch
11/Dec/16 14:33
25 kB
Mark Miller
SOLR-9824.patch
09/Dec/16 04:50
20 kB
Mark Miller
SOLR-9824.patch
09/Dec/16 02:39
25 kB
Mark Miller
SOLR-9824.patch
09/Dec/16 00:58
24 kB
Mark Miller

Issue Links

relates to

SOLR-7333 Make the poll queue time configurable and use knowledge that a batch is being processed to poll efficiently

Closed

Activity

People

Assignee:: Mark Miller

Reporter:: David Smiley

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 03/Dec/16 00:41

Updated:: 08/Jun/19 15:31

Resolved:: 08/Aug/17 03:06