This is an UpdateRequestProcessor and Factory that we have been using in production, to handle 2 common cases that were awkward to achieve using the existing update pipeline and current processor classes:
- When inserting document(s), if some already exist then quietly skip the new document inserts - do not churn the index by replacing the existing documents and do not throw a noisy exception that breaks the batch of inserts. By analogy with SQL, insert if not exists. In our use-case, multiple application instances can (rarely) process the same input so it's easier for us to de-dupe these at Solr insert time than to funnel them into a global ordered queue first.
- When applying AtomicUpdate documents, if a document being updated does not exist, quietly do nothing - do not create a new partially-populated document and do not throw a noisy exception about missing required fields. By analogy with SQL, update where id = ... Our use-case relies on this because we apply updates optimistically and have best-effort knowledge about what documents will exist, so it's easiest to skip the updates (in the same way a Database would).
I would have kept this in our own package hierarchy but it relies on some package-scoped methods, and seems like it could be useful to others if they choose to configure it. Some bits of the code were borrowed from DocBasedVersionConstraintsProcessorFactory.
Attached patch has unit tests to confirm the behaviour.
This class can be used by configuring solrconfig.xml like so..
<updateRequestProcessorChain name="skipexisting"> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="org.apache.solr.update.processor.SkipExistingDocumentsProcessorFactory"> <bool name="skipInsertIfExists">true</bool> <bool name="skipUpdateIfMissing">false</bool> <!-- We will override this per-request --> </processor> <processor class="solr.DistributedUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
and initParams defaults of