Solr
  1. Solr
  2. SOLR-3215

We should clone the SolrInputDocument before adding locally and then send that clone to replicas.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-BETA, Trunk
    • Component/s: None
    • Labels:
      None

      Description

      If we don't do this, the behavior is a little unexpected. You cannot avoid having other processors always hit documents twice unless we support using multiple update chains. We have another issue open that should make this better, but I'd like to do this sooner than that. We are going to have to end up cloning anyway when we want to offer the ability to not wait for the local add before sending to replicas.

      Cloning with the current SolrInputDocument, SolrInputField apis is a little scary - there is an Object to contend with - but it seems we can pretty much count on that being a primitive that we don't have to clone?

        Issue Links

          Activity

          Hide
          Lance Norskog added a comment - - edited

          Can I suggest the concept of a "preprocessed" or "flattened" or "read-only" document?

          Each processor would have the responsibility to say "I only changed un-flattened documents". After all, update processors can also do things like logging- it would not be ok to skip the logger just because it is above the DistributedProcessor stage.

          If you want, you can add a marker interface for DistributedProcessor that means "I alter documents". The update process chain handler can skip these processors.

          Show
          Lance Norskog added a comment - - edited Can I suggest the concept of a "preprocessed" or "flattened" or "read-only" document? Each processor would have the responsibility to say "I only changed un-flattened documents". After all, update processors can also do things like logging- it would not be ok to skip the logger just because it is above the DistributedProcessor stage. If you want, you can add a marker interface for DistributedProcessor that means "I alter documents". The update process chain handler can skip these processors.
          Hide
          Mark Miller added a comment -

          Okay, I'm going to put this in. I'll wait a bit to see if Yonik pipes in first.

          Show
          Mark Miller added a comment - Okay, I'm going to put this in. I'll wait a bit to see if Yonik pipes in first.
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          Mark Miller added a comment -

          nothing stops people from putting other processors between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory

          Yonik has argued we shouldn't support this rather than cloning.

          Show
          Mark Miller added a comment - nothing stops people from putting other processors between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory Yonik has argued we shouldn't support this rather than cloning.
          Hide
          Hoss Man added a comment -

          I think this is still pretty damn important:

          1) nothing stops people from putting other processors between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory
          2) several use cases identified above are simpler (or require) putting processors between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory
          3) any processor put between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory is a time bomb of unpredictability w/o some sort of cloning.

          Show
          Hoss Man added a comment - I think this is still pretty damn important: 1) nothing stops people from putting other processors between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory 2) several use cases identified above are simpler (or require ) putting processors between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory 3) any processor put between DistributeUpdateProcessorFactory and RunUpdateProcessorFactory is a time bomb of unpredictability w/o some sort of cloning.
          Hide
          Mark Miller added a comment -

          we need to reevaluate this in regards to the update processor work that hossman did - this may not be needed unless something else prompts it

          Show
          Mark Miller added a comment - we need to reevaluate this in regards to the update processor work that hossman did - this may not be needed unless something else prompts it
          Hide
          Hoss Man added a comment -

          Are there use cases for this?)

          FWIW: the other use case that just occurred to me today is trying to use any update processors that deal with multiple field values (either existing processors people may have written, or any of the new ones added by SOLR-2802, or SOLR-2599) in conjunction with field updating (SOLR-139) which is implemented as part of the DistributedUpdateProcessor.

          ie: if you want to have a multivalued "all_prices" field, and a single valued "lowest_price" field that you populate using a combination of CloneFieldUpdateProcessorFactory + MinFieldValueUpdateProcessorFactory – that will need to happen after the DistributedUpdateProcessor in order to ensure all the values are available if someone does field update to "add" a value to "all_prices".

          Show
          Hoss Man added a comment - Are there use cases for this?) FWIW: the other use case that just occurred to me today is trying to use any update processors that deal with multiple field values (either existing processors people may have written, or any of the new ones added by SOLR-2802 , or SOLR-2599 ) in conjunction with field updating ( SOLR-139 ) which is implemented as part of the DistributedUpdateProcessor. ie: if you want to have a multivalued "all_prices" field, and a single valued "lowest_price" field that you populate using a combination of CloneFieldUpdateProcessorFactory + MinFieldValueUpdateProcessorFactory – that will need to happen after the DistributedUpdateProcessor in order to ensure all the values are available if someone does field update to "add" a value to "all_prices".
          Hide
          Hoss Man added a comment -

          Cloning with the current SolrInputDocument, SolrInputField apis is a little scary - there is an Object to contend with - but it seems we can pretty much count on that being a primitive that we don't have to clone?

          ... currently we have to offer this (local is special) as well though - it's a requirement for our current 'super safe' mode that we add locally first.

          Strawman suggestion: instead of using simple SolrInputDocument.clone(), with the scariness miller mentioned about some processor maybe creating a field value that isn't Clonable, what if instead we:

          1. use the JavaBinCodec to serialize the SolrInputDocument to a byte[]
          2. hang onto that byte[] while doing the local add
          3. then (re)use that byte[] in all of the requests to the remote replicas

          not sure how easy the resuse of the byte[] would be given the existing SolrJ API but...

          • Even if some field values aren't Clonable primatives, they have to be serializable using the codec or we already have a bigger bug to worry about the the risk of concurrent mods to the object
          • Bonus: we only pay the cost of serializing the SolrInputDocument once on the leader, not N times for N replicas.
          • Only "downside" i can think of is that the leader has to buffer the whole doc in memory as a byte[] instead of streaming it – but if we assume most shards will have more then N replicas, the trade off seems worth it – to me anyway (optimize to use more RAM and save time/cpu in serializing redundently)
          Show
          Hoss Man added a comment - Cloning with the current SolrInputDocument, SolrInputField apis is a little scary - there is an Object to contend with - but it seems we can pretty much count on that being a primitive that we don't have to clone? ... currently we have to offer this (local is special) as well though - it's a requirement for our current 'super safe' mode that we add locally first. Strawman suggestion: instead of using simple SolrInputDocument.clone() , with the scariness miller mentioned about some processor maybe creating a field value that isn't Clonable, what if instead we: use the JavaBinCodec to serialize the SolrInputDocument to a byte[] hang onto that byte[] while doing the local add then (re)use that byte[] in all of the requests to the remote replicas not sure how easy the resuse of the byte[] would be given the existing SolrJ API but... Even if some field values aren't Clonable primatives, they have to be serializable using the codec or we already have a bigger bug to worry about the the risk of concurrent mods to the object Bonus: we only pay the cost of serializing the SolrInputDocument once on the leader, not N times for N replicas. Only "downside" i can think of is that the leader has to buffer the whole doc in memory as a byte[] instead of streaming it – but if we assume most shards will have more then N replicas, the trade off seems worth it – to me anyway (optimize to use more RAM and save time/cpu in serializing redundently)
          Hide
          Mark Miller added a comment -

          one way to avoid this would be to stop treating the "local" replica as special

          Just to point out explicitly: currently we have to offer this (local is special) as well though - it's a requirement for our current 'super safe' mode that we add locally first. So unless we also address some other issues, we'd have to allow things to happen both ways.

          Show
          Mark Miller added a comment - one way to avoid this would be to stop treating the "local" replica as special Just to point out explicitly: currently we have to offer this (local is special) as well though - it's a requirement for our current 'super safe' mode that we add locally first. So unless we also address some other issues, we'd have to allow things to happen both ways.
          Hide
          Hoss Man added a comment -

          What about the sig update proc or others that do something like modify the updateTerm on the update command - this type of thing does not get distributed as we turn update commands into update requests, and the mapping doesn't cover updateTerm.

          that seems like a complicity distinct problem ... as i mentioned in SOLR-3473, the distrib update proc really needs to be aware of all properties of the AddUpdateCommand and serializes/deserialize that info when forwarding to the leader (and all of the replicas) .. in the case of "updateTerm" it's not even enough to make sure the shard leader and all replicas know about it – other docs in other shards might also need to be deleted.

          Show
          Hoss Man added a comment - What about the sig update proc or others that do something like modify the updateTerm on the update command - this type of thing does not get distributed as we turn update commands into update requests, and the mapping doesn't cover updateTerm. that seems like a complicity distinct problem ... as i mentioned in SOLR-3473 , the distrib update proc really needs to be aware of all properties of the AddUpdateCommand and serializes/deserialize that info when forwarding to the leader (and all of the replicas) .. in the case of "updateTerm" it's not even enough to make sure the shard leader and all replicas know about it – other docs in other shards might also need to be deleted.
          Hide
          Mark Miller added a comment -

          Are there use cases for this?

          What about the sig update proc or others that do something like modify the updateTerm on the update command - this type of thing does not get distributed as we turn update commands into update requests, and the mapping doesn't cover updateTerm.

          Show
          Mark Miller added a comment - Are there use cases for this? What about the sig update proc or others that do something like modify the updateTerm on the update command - this type of thing does not get distributed as we turn update commands into update requests, and the mapping doesn't cover updateTerm.
          Hide
          Hoss Man added a comment -

          DistributedUpdateProcessor should come right before RunUpdateProcessor (or are you assuming we might support random update processors in-between? Are there use cases for this?)

          the main scenerio i've seen/heard mentioned is the idea of processors that are computationally cheap, but increase the size of the document significantly (ie: clone a big ass text field and strip the html from the clone) so you want it to happen after distrib (on every replica) to minimize the amount of data sent over the wire.


          Intuitively, you expect that processors that run after the distrib processor will not hit the document sent to replicas before the docs are sent to replicas - but it will.

          to clarify (because i kept not-understanding what the crux of the issue was here so if i post this comment i'll remember next time w/o needing to ask mark on IRC again) if we do NOT clone the doc, there is a race condition where local processors executing after the distrib processor may modify the documents before the are serialized and forwarded to one or more shards.

          one way to avoid this would be to stop treating the "local" replica as special, and instead have distrib forward back to localhost (via HTTP) just like every other replica) and abort the current request.

          Show
          Hoss Man added a comment - DistributedUpdateProcessor should come right before RunUpdateProcessor (or are you assuming we might support random update processors in-between? Are there use cases for this?) the main scenerio i've seen/heard mentioned is the idea of processors that are computationally cheap, but increase the size of the document significantly (ie: clone a big ass text field and strip the html from the clone) so you want it to happen after distrib (on every replica) to minimize the amount of data sent over the wire. Intuitively, you expect that processors that run after the distrib processor will not hit the document sent to replicas before the docs are sent to replicas - but it will. to clarify (because i kept not-understanding what the crux of the issue was here so if i post this comment i'll remember next time w/o needing to ask mark on IRC again ) if we do NOT clone the doc, there is a race condition where local processors executing after the distrib processor may modify the documents before the are serialized and forwarded to one or more shards. one way to avoid this would be to stop treating the "local" replica as special, and instead have distrib forward back to localhost (via HTTP) just like every other replica) and abort the current request.
          Hide
          Mark Miller added a comment -

          A couple use cases:

          1. You don't want documents to have every processor applied to them twice. Intuitively, you expect that processors that run after the distrib processor will not hit the document sent to replicas before the docs are sent to replicas - but it will. Barring future features (like the JIRA around update chains and solrcloud), there is no way currently to have an update processor only hit a document once when they go to a replica.

          2. When we want to offer the option of adding locally and sending to replicas at the same time, we still need this, and we don't want to get people used to the current odd behavior because you couldn't even easily duplicate it (not that i think you'd want to) if you wanted to add locally and send to replicas in parallel.

          Show
          Mark Miller added a comment - A couple use cases: 1. You don't want documents to have every processor applied to them twice. Intuitively, you expect that processors that run after the distrib processor will not hit the document sent to replicas before the docs are sent to replicas - but it will. Barring future features (like the JIRA around update chains and solrcloud), there is no way currently to have an update processor only hit a document once when they go to a replica. 2. When we want to offer the option of adding locally and sending to replicas at the same time, we still need this, and we don't want to get people used to the current odd behavior because you couldn't even easily duplicate it (not that i think you'd want to) if you wanted to add locally and send to replicas in parallel.
          Hide
          Yonik Seeley added a comment -

          Not sure I understand the problem this is solving. version is added to the SolrInputDocument and then that can be indexed locally and sent to replicas. DistributedUpdateProcessor should come right before RunUpdateProcessor (or are you assuming we might support random update processors in-between? Are there use cases for this?)

          Show
          Yonik Seeley added a comment - Not sure I understand the problem this is solving. version is added to the SolrInputDocument and then that can be indexed locally and sent to replicas. DistributedUpdateProcessor should come right before RunUpdateProcessor (or are you assuming we might support random update processors in-between? Are there use cases for this?)
          Hide
          Mark Miller added a comment -

          I'd like to commit this soon.

          Show
          Mark Miller added a comment - I'd like to commit this soon.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              3 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development