I think in an ideal world...
- Solr Cloud users should be able to just declare a straightforwrd chain of processors and not need to worry about which nodes they run on (ie: any chain they used pre-cloud should still work)
- if a user wants to care what parts of the chain run once vs on every node, it should be simple to indicate that decision by putting some "marker" processor (ie: DistributedUpdateProcessorFactory) in the chain denoting when distribution should happen.
- If you are using cloud mode, and don't specify where the distributed update logic happens, then Solr should pick for you – either "first" before any other processors, or "last" just before RunUpdateProcessor ... I don't have an opinion which is better, i'm sure other people who have been experimenting with Solr Cloud for a while can tell me.
(I'm usually the person opposed to doing things magically behind the scenes like this, but "Solr Cloud" is becoming a central enough concept in Solr, with many components doing special things if "zkEnabled", that I think moving forward it's "ok" to treat distributed updating as the norm in cloud mode and optimize for it.)
The most straight forward way i can think of to do this would be:
- if Solr is in "cloud mode" then when SolrCore is initializing the chains from solrconfig.xml, any chain that doesn't include DistributedUpdateProcessorFactory should have it added automatically (alternatively: maybe we only add it if RunUpdateProcessor is in the chain, so anyone using chains w/o RunUpdateProcessor – for some weird bizare purpose we can't imagine – won't be surprised by DistributedUpdateProcessorFactory getting added magically)
- DistributedUpdateProcessor should add some param when forwarding to any other node indicating that the request is being distributed
- when an update request is received, if it has this special param in it, then any processor in the chain prior to DistributedUpdateProcessorFactory is skipped
This idea is very similar to part of what Jan suggested in his comment above...
The Distrib processor could set some state info on the request to the next node so that chain processing could continue where it left off. E.g. &update.chain.nextproc=<name/id-of-next-proc>. This would require introduction of named processor instances.
...but what i'm thinking of wouldn't require named processors and would be specific to distributed updates (but wouldn't precluded named processors and more enhanced logic down the road if someone wanted it).
I think this would be fairly feasible just by making some small modifications to DistributedUpdateProcessor (to add the new special param when forwarding) and UpdateRequestProcessorChain (to inject the DistributedUpdateProcessorFactory if cloud mode, and to skip up to the DistributedUpdateProcessorFactory if the param is set). I do however still think we should generalize somewhat:
- DistributedUpdateProcessorFactory should be made to implement some marker interface with no methods (ie: DistributedUpdateMarker)
- UpdateRequestProcessorChain.init should scan for instances of DistributedUpdateMarker in the chain (instead of looking explicitly for DistributedUpdateProcessorFactory) when deciding whether to inject a new DistributedUpdateProcessorFactory into the chain
- UpdateRequestProcessorChain.createProcessor should scan for instances of DistributedUpdateMarker in the chain (instead of looking explicitly for DistributedUpdateProcessorFactory) when "skipping ahead" if the special param is found in the request
...that way advanced users can implement their own distributed update processor implementing that interface and register it explicitly in their chain if they are so inclined, or implement a NoOp update processor implementing that interface if they want to bipass the magic completley.
As a possible optimization/simplification to what gets sent over the wire, the new param we add that UpdateRequestProcessorChain would start looking for to "skip ahead" in the chain could replace the existing "leader" boolean param DistributedUpdateProcessor currently uses (aka: SEEN_LEADER) by having an enum style param (perhaps called "update.distrib" ?)...
- none - default if unset, means no distribution has happend
- toleader - means the request is being sent to the leader
- fromleader - means the leader is sending the request to all nodes
UpdateRequestProcessorChain would only care if the value is not "none", in which case it would skip ahead to the DistributedUpdateMarker in the chain. DistributedUpdateProcessorFactory would care if the value is "toleader" or "fromleader" in which case it's logic would toggle in the same way it does currently for SEEN_LEADER.