Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-3473

Distributed deduplication broken when using non-uniqueKey for signatureField

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.9, 6.0
    • Component/s: SolrCloud, update
    • Labels:
      None

      Description

      The current state of things (as of 8.8) is that SignatureUpdateProcessorFactory CAN be safely used in in SolrCloud for two possible usecases:

      • For de-duplication:
        • the signatureField MUST be the uniqueKey field AND the processor MUST be configured to run prior to DistributedUpdateProcessor
      • Solely for generating signatures, w/o de-duplication
        • overwriteDupes MUST be set to false ... any signatureField may be used, and it may run at any point in the processor chain

      If you attempt to use SignatureUpdateProcessorFactory for de-duplication w/ a non-uniqueKey signature field, one of two failure situations are likely to arise:

      • in a multi-shard collection, documents with identical signatureField values will not be removed from any shard (leader) other then the one the document is routed to (by it's id)
      • even in a single-shard collection, with multiple replicas, documents with identical signatureField values will only be deleted on the 'leader' and not on any other replicas, because the leader does not propogate the AddUpdateCommand.updateTerm computed by the SignatureUpdateProcessorFactory to each of it's shards
      original bug report

      Solr's deduplication via the SignatureUpdateProcessor is broken for distributed updates on SolrCloud.

      Mark Miller:

      Looking again at the SignatureUpdateProcessor code, I think that indeed this won't currently work with distrib updates. Could you file a JIRA issue for that? The problem is that we convert update commands into solr documents - and that can cause a loss of info if an update proc modifies the update command.

      I think the reason that you see a multiple values error when you try the other order is because of the lack of a document clone (the other issue I mentioned a few emails back). Addressing that won't solve your issue though - we have to come up with a way to propagate the currently lost info on the update command.

      Please see the ML thread for the full discussion: http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html

        Attachments

        1. SOLR-3473.patch
          3 kB
          Chris M. Hostetter
        2. SOLR-3473.patch
          9 kB
          Chris M. Hostetter
        3. SOLR-3473-trunk-2.patch
          9 kB
          Markus Jelsma

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              markus17 Markus Jelsma

              Dates

              • Created:
                Updated:

                Issue deployment