Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9918

An UpdateRequestProcessor to skip duplicate inserts and ignore updates to missing docs

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4, 7.0
    • Component/s: update
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None
    • Flags:
      Patch

      Description

      This is an UpdateRequestProcessor and Factory that we have been using in production, to handle 2 common cases that were awkward to achieve using the existing update pipeline and current processor classes:

      • When inserting document(s), if some already exist then quietly skip the new document inserts - do not churn the index by replacing the existing documents and do not throw a noisy exception that breaks the batch of inserts. By analogy with SQL, insert if not exists. In our use-case, multiple application instances can (rarely) process the same input so it's easier for us to de-dupe these at Solr insert time than to funnel them into a global ordered queue first.
      • When applying AtomicUpdate documents, if a document being updated does not exist, quietly do nothing - do not create a new partially-populated document and do not throw a noisy exception about missing required fields. By analogy with SQL, update where id = ... Our use-case relies on this because we apply updates optimistically and have best-effort knowledge about what documents will exist, so it's easiest to skip the updates (in the same way a Database would).

      I would have kept this in our own package hierarchy but it relies on some package-scoped methods, and seems like it could be useful to others if they choose to configure it. Some bits of the code were borrowed from DocBasedVersionConstraintsProcessorFactory.

      Attached patch has unit tests to confirm the behaviour.

      This class can be used by configuring solrconfig.xml like so..

        <updateRequestProcessorChain name="skipexisting">
          <processor class="solr.LogUpdateProcessorFactory" />
          <processor class="org.apache.solr.update.processor.SkipExistingDocumentsProcessorFactory">
            <bool name="skipInsertIfExists">true</bool>
            <bool name="skipUpdateIfMissing">false</bool> <!-- We will override this per-request -->
          </processor>
          <processor class="solr.DistributedUpdateProcessorFactory" />
          <processor class="solr.RunUpdateProcessorFactory" />
        </updateRequestProcessorChain>
      

      and initParams defaults of

            <str name="update.chain">skipexisting</str>
      
      1. SOLR-9918.patch
        27 kB
        Tim Owen
      2. SOLR-9918.patch
        26 kB
        Tim Owen

        Activity

        Hide
        koji Koji Sekiguchi added a comment -

        I believe the proposal is very useful for users who need this function, but it is better for users if there is an additional explanation of the difference from the existing one that gives similar function.

        How do users decide which UpdateRequestProcessor to use for their use cases as compared to SignatureUpdateProcessor?

        Show
        koji Koji Sekiguchi added a comment - I believe the proposal is very useful for users who need this function, but it is better for users if there is an additional explanation of the difference from the existing one that gives similar function. How do users decide which UpdateRequestProcessor to use for their use cases as compared to SignatureUpdateProcessor?
        Hide
        dsmiley David Smiley added a comment -

        Cool

        Show
        dsmiley David Smiley added a comment - Cool
        Hide
        TimOwen Tim Owen added a comment -

        Fair points Koji - I have updated the patch with a bit more documentation. I've also added the example configuration in the Javadoc comment.

        Probably the Confluence page is the best place to put that kind of guideline notes on which processors to choose for different situations.

        In the particular case of the SignatureUpdateProcessor, that class will cause the new document to overwrite/replace any existing document, not skip it, which is why I didn't use it for our use-case.

        Show
        TimOwen Tim Owen added a comment - Fair points Koji - I have updated the patch with a bit more documentation. I've also added the example configuration in the Javadoc comment. Probably the Confluence page is the best place to put that kind of guideline notes on which processors to choose for different situations. In the particular case of the SignatureUpdateProcessor, that class will cause the new document to overwrite/replace any existing document, not skip it, which is why I didn't use it for our use-case.
        Hide
        koji Koji Sekiguchi added a comment -

        Thank you for your additional explanation. I agree with you on the Confluence page is the best place to put that kind of guideline notes. I just wanted to see such information in the ticket, not javadoc, because I think it helps committers to understand the requirement and importance of this proposal.

        As for SignatureUpdateProcessor, I thought it skipped to add the doc if the signature is same, but when I looked into the patch on SOLR-799, I noticed that it always updates the existing document even if the doc has the same signature.

        Show
        koji Koji Sekiguchi added a comment - Thank you for your additional explanation. I agree with you on the Confluence page is the best place to put that kind of guideline notes. I just wanted to see such information in the ticket, not javadoc, because I think it helps committers to understand the requirement and importance of this proposal. As for SignatureUpdateProcessor, I thought it skipped to add the doc if the signature is same, but when I looked into the patch on SOLR-799 , I noticed that it always updates the existing document even if the doc has the same signature.
        Hide
        TimOwen Tim Owen added a comment -

        OK I see what you mean, I can explain our use-case if that helps to understand why we developed this processor, and when it might prove useful.

        We have a Kafka queue of messages, which are a mixture of Create, Update and Delete operations, and these are consumed and fed into two different storage systems - Solr and a RDBMS. We want the behaviour to be consistent, so that the two systems are in sync, and the way the Database storage app works is that Create operations are implemented as effectively INSERT IF NOT EXISTS ... and Update operations are the typical SQL UPDATE .. WHERE id = .. that quietly do nothing if there is no row for id. So we want the Solr storage to behave in the same way.

        There can occasionally be duplicate messages that Create the same id due to the hundreds of instances of the app that adds messages to Kafka, and small race conditions that mean two or more of them will do some duplicate work. We chose to accept this situation and de-dupe downstream by having both storage apps behave as above.

        Another scenario is that, since we have the Kafka queue as a buffer, if there's any problems downstream we can always stop the storage apps, restore last night's backup, rewind the Kafka consumer offset (slightly beyond the backup point) and then replay. In this situation we don't want a lot of index churn for the overlap Create messages.

        With updates, the apps which add Update messages only have best-effort knowledge of which document/row {{id}}s are relevant to the field/column being changed by the update message. So we quite commonly have messages that are optimistic updates, for a document that doesn't in fact exist (now). The database storage handles this quietly, so we wanted the same behaviour in Solr. Initially what happened in Solr was we'd get newly-created documents containing only the fields changed in the AtomicUpdate, so we added a required field to avoid that happening, which works but is noisy as we get a Solr exception each time (and then batch updates are messy because we have to split and retry).

        I looked at DocBasedVersionConstraintsProcessor but we don't have explicitly-managed versioning for our documents in Solr. Then I looked at SignatureUpdateProcessor but that does churn the index and overwrites documents, which we didn't want. Also considered TolerantUpdateProcessor but that isn't really solving the issue for inserts, it just would make some update batches less noisy.

        I'd say this processor is useful in situations where you have documents that don't have any concept of multiple versions that can be assigned by the app, and don't have any kind of fuzzy-ness about similar documents i.e. each document has a strong identity, akin to what a Database unique key is.

        Show
        TimOwen Tim Owen added a comment - OK I see what you mean, I can explain our use-case if that helps to understand why we developed this processor, and when it might prove useful. We have a Kafka queue of messages, which are a mixture of Create, Update and Delete operations, and these are consumed and fed into two different storage systems - Solr and a RDBMS. We want the behaviour to be consistent, so that the two systems are in sync, and the way the Database storage app works is that Create operations are implemented as effectively INSERT IF NOT EXISTS ... and Update operations are the typical SQL UPDATE .. WHERE id = .. that quietly do nothing if there is no row for id . So we want the Solr storage to behave in the same way. There can occasionally be duplicate messages that Create the same id due to the hundreds of instances of the app that adds messages to Kafka, and small race conditions that mean two or more of them will do some duplicate work. We chose to accept this situation and de-dupe downstream by having both storage apps behave as above. Another scenario is that, since we have the Kafka queue as a buffer, if there's any problems downstream we can always stop the storage apps, restore last night's backup, rewind the Kafka consumer offset (slightly beyond the backup point) and then replay. In this situation we don't want a lot of index churn for the overlap Create messages. With updates, the apps which add Update messages only have best-effort knowledge of which document/row {{id}}s are relevant to the field/column being changed by the update message. So we quite commonly have messages that are optimistic updates, for a document that doesn't in fact exist (now). The database storage handles this quietly, so we wanted the same behaviour in Solr. Initially what happened in Solr was we'd get newly-created documents containing only the fields changed in the AtomicUpdate, so we added a required field to avoid that happening, which works but is noisy as we get a Solr exception each time (and then batch updates are messy because we have to split and retry). I looked at DocBasedVersionConstraintsProcessor but we don't have explicitly-managed versioning for our documents in Solr. Then I looked at SignatureUpdateProcessor but that does churn the index and overwrites documents, which we didn't want. Also considered TolerantUpdateProcessor but that isn't really solving the issue for inserts, it just would make some update batches less noisy. I'd say this processor is useful in situations where you have documents that don't have any concept of multiple versions that can be assigned by the app, and don't have any kind of fuzzy-ness about similar documents i.e. each document has a strong identity, akin to what a Database unique key is.
        Hide
        koji Koji Sekiguchi added a comment -

        Thank you for giving the great explanation which is more than I expected.

        Show
        koji Koji Sekiguchi added a comment - Thank you for giving the great explanation which is more than I expected.
        Hide
        koji Koji Sekiguchi added a comment -

        I think this is ready.

        Show
        koji Koji Sekiguchi added a comment - I think this is ready.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit d66bfba5dc1bd9154bd48898865f51d9715e8d0c in lucene-solr's branch refs/heads/master from koji
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d66bfba ]

        SOLR-9918: Add SkipExistingDocumentsProcessor that skips duplicate inserts and ignores updates to missing docs

        Show
        jira-bot ASF subversion and git services added a comment - Commit d66bfba5dc1bd9154bd48898865f51d9715e8d0c in lucene-solr's branch refs/heads/master from koji [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d66bfba ] SOLR-9918 : Add SkipExistingDocumentsProcessor that skips duplicate inserts and ignores updates to missing docs
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2979a1eacd916201548303245f81705da7f9cc36 in lucene-solr's branch refs/heads/branch_6x from koji
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2979a1e ]

        SOLR-9918: Add SkipExistingDocumentsProcessor that skips duplicate inserts and ignores updates to missing docs

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2979a1eacd916201548303245f81705da7f9cc36 in lucene-solr's branch refs/heads/branch_6x from koji [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2979a1e ] SOLR-9918 : Add SkipExistingDocumentsProcessor that skips duplicate inserts and ignores updates to missing docs
        Hide
        koji Koji Sekiguchi added a comment -

        Thanks, Tim!

        Show
        koji Koji Sekiguchi added a comment - Thanks, Tim!
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2437204730130dc8c03efb111ec7d4db456189ed in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2437204 ]

        SOLR-9918: Remove unused import to make precommit happy

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2437204730130dc8c03efb111ec7d4db456189ed in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2437204 ] SOLR-9918 : Remove unused import to make precommit happy
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2f721048d4e9e35ba81ad574d3927cdba798ee24 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2f72104 ]

        SOLR-9918: Remove unused import to make precommit happy

        (cherry picked from commit 2437204)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2f721048d4e9e35ba81ad574d3927cdba798ee24 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2f72104 ] SOLR-9918 : Remove unused import to make precommit happy (cherry picked from commit 2437204)

          People

          • Assignee:
            koji Koji Sekiguchi
            Reporter:
            TimOwen Tim Owen
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development