[SOLR-3473] Distributed deduplication broken when using non-uniqueKey for signatureField - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 4.0-ALPHA
Fix Version/s: 4.9, 6.0
Component/s: SolrCloud, update
Labels:
None

Description

The current state of things (as of 8.8) is that SignatureUpdateProcessorFactory CAN be safely used in in SolrCloud for two possible usecases:

For de-duplication:
- the signatureField MUST be the uniqueKey field AND the processor MUST be configured to run prior to DistributedUpdateProcessor
Solely for generating signatures, w/o de-duplication
- overwriteDupes MUST be set to false ... any signatureField may be used, and it may run at any point in the processor chain

If you attempt to use SignatureUpdateProcessorFactory for de-duplication w/ a non-uniqueKey signature field, one of two failure situations are likely to arise:

in a multi-shard collection, documents with identical signatureField values will not be removed from any shard (leader) other then the one the document is routed to (by it's id)
even in a single-shard collection, with multiple replicas, documents with identical signatureField values will only be deleted on the 'leader' and not on any other replicas, because the leader does not propogate the AddUpdateCommand.updateTerm computed by the SignatureUpdateProcessorFactory to each of it's shards

original bug report

Solr's deduplication via the SignatureUpdateProcessor is broken for distributed updates on SolrCloud.

Mark Miller:

Looking again at the SignatureUpdateProcessor code, I think that indeed this won't currently work with distrib updates. Could you file a JIRA issue for that? The problem is that we convert update commands into solr documents - and that can cause a loss of info if an update proc modifies the update command.

I think the reason that you see a multiple values error when you try the other order is because of the lack of a document clone (the other issue I mentioned a few emails back). Addressing that won't solve your issue though - we have to come up with a way to propagate the currently lost info on the update command.

Please see the ML thread for the full discussion: http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-3473.patch
30/May/12 23:33
9 kB
Chris M. Hostetter
SOLR-3473.patch
29/May/12 19:38
3 kB
Chris M. Hostetter
SOLR-3473-trunk-2.patch
06/Aug/12 10:20
9 kB
Markus Jelsma

Issue Links

is related to

SOLR-2822 don't run update processors twice

Closed

SOLR-4016 Deduplication is broken by partial update

Closed

SOLR-15290 Better Docs/Tests/Warnings/Defaults for SignatureUpdateProcessorFactory in SolrCloud

Open

relates to

SOLR-3215 We should clone the SolrInputDocument before adding locally and then send that clone to replicas.

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Markus Jelsma

Votes:: 3 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 21/May/12 16:16

Updated:: 25/Mar/21 18:29