Solr
  1. Solr
  2. SOLR-4016

Deduplication is broken by partial update

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, 6.0
    • Component/s: update
    • Labels:
    • Environment:

      Tomcat6 / Catalina on Ubuntu 12.04 LTS

      Description

      The SignatureUpdateProcessorFactory used (primarily?) for deduplication does not consider partial update semantics.

      The below uses the following solrconfig.xml excerpt:

           <updateRequestProcessorChain name="text_hash">
             <processor class="solr.processor.SignatureUpdateProcessorFactory">
               <bool name="enabled">true</bool>
               <str name="signatureField">text_hash</str>
               <bool name="overwriteDupes">false</bool>
               <str name="fields">text</str>
               <str name="signatureClass">solr.processor.TextProfileSignature</str>
             </processor>
             <processor class="solr.LogUpdateProcessorFactory" />
             <processor class="solr.RunUpdateProcessorFactory" />
           </updateRequestProcessorChain>
      

      Firstly, the processor treats

      {"set": "value"}

      as a string and hashes it, instead of the value alone:

      $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{"add":{"doc":{"id": "abcde", "text": {"set": "hello world"}}}}' && curl '$URL/select?q=id:abcde'
      {"responseHeader":{"status":0,"QTime":30}}
      <?xml version="1.0" encoding="UTF-8"?><response><lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">id:abcde</str></lst></lst><result name="response" numFound="1" start="0"><doc><str name="id">abcde</str><str name="text">hello world</str><str name="text_hash">ad48c7ad60ac22cc</str><long name="_version_">1417247434224959488</long></doc></result>
      </response>
      $
      $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{"add":{"doc":{"id": "abcde", "text": "hello world"}}}' && curl '$URL/select?q=id:abcde'
      {"responseHeader":{"status":0,"QTime":27}}
      <?xml version="1.0" encoding="UTF-8"?>
      <response>
      <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">id:abcde</str></lst></lst><result name="response" numFound="1" start="0"><doc><str name="id">abcde</str><str name="text">hello world</str><str name="text_hash">b169c743d220da8d</str><long name="_version_">1417248022215000064</long></doc></result>
      </response>
      

      Note the different text_hash value.

      Secondly, when updating a field other than those used to create the signature (which I imagine is a more common use-case), the signature is recalculated from no values:

      $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{"add":{"doc":{"id": "abcde", "title": {"set": "new title"}}}}' && curl '$URL/select?q=id:abcde'
      {"responseHeader":{"status":0,"QTime":39}}
      <?xml version="1.0" encoding="UTF-8"?>
      <response>
      <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">id:abcde</str></lst></lst><result name="response" numFound="1" start="0"><doc><str name="id">abcde</str><str name="text">hello world</str><str name="text_hash">0000000000000000</str><str name="title">new title</str><long name="_version_">1417248120480202752</long></doc></result>
      </response>
      
      1. SOLR-4016.patch
        11 kB
        Shalin Shekhar Mangar
      2. SOLR-4016-disallow-partial-update.patch
        8 kB
        Shalin Shekhar Mangar
      3. SOLR-4016-disallow-partial-update.patch
        3 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          Shalin Shekhar Mangar added a comment -

          All these problems go away if we add the DistributedUpdateProcessorFactory ahead of all other processors instead of adding it just before RunUpdateProcessorFactory by default.

          Yonik Seeley, is there a reason why we add DUPF at the last? It seems that other processors such as RegexReplaceProcessorFactory will also be affected.

          Show
          Shalin Shekhar Mangar added a comment - All these problems go away if we add the DistributedUpdateProcessorFactory ahead of all other processors instead of adding it just before RunUpdateProcessorFactory by default. Yonik Seeley , is there a reason why we add DUPF at the last? It seems that other processors such as RegexReplaceProcessorFactory will also be affected.
          Hide
          Yonik Seeley added a comment -

          If the SignatureUpdateProcessorFactory is generating the unique id, it must come before the distributed code
          SOLR-2822 is a good starting place for related issues / discussions (I know there were more discussions but I can't find them now).

          Show
          Yonik Seeley added a comment - If the SignatureUpdateProcessorFactory is generating the unique id, it must come before the distributed code SOLR-2822 is a good starting place for related issues / discussions (I know there were more discussions but I can't find them now).
          Hide
          Shalin Shekhar Mangar added a comment -

          This patch expands the document before computing signature.

          I'm not convinced that it is the right solution. The DUPF gets the updated document in a synchronized (bucket) block which we don't. We could set the original document back (after adding signature to it) and let DUPF do its thing but that could lead to race conditions.

          Perhaps we should decouple the document expansion for partial updates from DistributedUpdateRequestProcessor and apply it at the start of the request so that all UpdateRequestProcessors can work on the full document.

          I don't fully comprehend the race conditions that may happen so I'll let someone more knowledgeable about this code to comment before proceeding any further.

          Show
          Shalin Shekhar Mangar added a comment - This patch expands the document before computing signature. I'm not convinced that it is the right solution. The DUPF gets the updated document in a synchronized (bucket) block which we don't. We could set the original document back (after adding signature to it) and let DUPF do its thing but that could lead to race conditions. Perhaps we should decouple the document expansion for partial updates from DistributedUpdateRequestProcessor and apply it at the start of the request so that all UpdateRequestProcessors can work on the full document. I don't fully comprehend the race conditions that may happen so I'll let someone more knowledgeable about this code to comment before proceeding any further.
          Hide
          Yonik Seeley added a comment -

          I think we should simply skip the "generate signature" logic of the processor if it's an atomic update.

          If one did want to do a full-blown solution, one way would probably involve using realtime-get to get the latest version of the document and then use optimistic concurrency when making the update. But I don't think it's that big of a restriction to say that atomic updates won't change the signature generated.

          Show
          Yonik Seeley added a comment - I think we should simply skip the "generate signature" logic of the processor if it's an atomic update. If one did want to do a full-blown solution, one way would probably involve using realtime-get to get the latest version of the document and then use optimistic concurrency when making the update. But I don't think it's that big of a restriction to say that atomic updates won't change the signature generated.
          Hide
          Shalin Shekhar Mangar added a comment -

          I think we should simply skip the "generate signature" logic of the processor if it's an atomic update.

          I see why you suggested that. The signature is like a unique key and modifying it seems like a rare use-case. But, if we do go that way, we should throw an exception and explicitly disallow partial update of signature generating fields. Otherwise you end up with documents containing a wrong signature.

          If one did want to do a full-blown solution, one way would probably involve using realtime-get to get the latest version of the document and then use optimistic concurrency when making the update.

          DUPF code says that it synchronizes on the bucket so that realtime-get works reliably. Is that necessary inside DedupUPF? If so, how do I get access to the VersionInfo? If not, is this patch an appropriate solution?

          Show
          Shalin Shekhar Mangar added a comment - I think we should simply skip the "generate signature" logic of the processor if it's an atomic update. I see why you suggested that. The signature is like a unique key and modifying it seems like a rare use-case. But, if we do go that way, we should throw an exception and explicitly disallow partial update of signature generating fields. Otherwise you end up with documents containing a wrong signature. If one did want to do a full-blown solution, one way would probably involve using realtime-get to get the latest version of the document and then use optimistic concurrency when making the update. DUPF code says that it synchronizes on the bucket so that realtime-get works reliably. Is that necessary inside DedupUPF? If so, how do I get access to the VersionInfo? If not, is this patch an appropriate solution?
          Hide
          Shalin Shekhar Mangar added a comment -

          Patch which disallows partial updates on signature generating fields

          Show
          Shalin Shekhar Mangar added a comment - Patch which disallows partial updates on signature generating fields
          Hide
          Shalin Shekhar Mangar added a comment -

          Patch with a better test.

          Show
          Shalin Shekhar Mangar added a comment - Patch with a better test.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Shalin Shekhar Mangar
          http://svn.apache.org/viewvc?view=revision&revision=1433013

          SOLR-4016: Deduplication does not work with atomic/partial updates so disallow atomic update requests which change signature generating fields.

          Show
          Commit Tag Bot added a comment - [trunk commit] Shalin Shekhar Mangar http://svn.apache.org/viewvc?view=revision&revision=1433013 SOLR-4016 : Deduplication does not work with atomic/partial updates so disallow atomic update requests which change signature generating fields.
          Hide
          Yonik Seeley added a comment -

          I see why you suggested that. The signature is like a unique key and modifying it seems like a rare use-case. But, if we do go that way, we should throw an exception and explicitly disallow partial update of signature generating fields.

          There are different use-cases here. If the signature being generated was the unique key, then atomic updates should be able to proceed fine as long as the id field is specified (as should always be the case with atomic updates).

          Show
          Yonik Seeley added a comment - I see why you suggested that. The signature is like a unique key and modifying it seems like a rare use-case. But, if we do go that way, we should throw an exception and explicitly disallow partial update of signature generating fields. There are different use-cases here. If the signature being generated was the unique key, then atomic updates should be able to proceed fine as long as the id field is specified (as should always be the case with atomic updates).
          Hide
          Shalin Shekhar Mangar added a comment -

          If the signature being generated was the unique key, then atomic updates should be able to proceed fine as long as the id field is specified (as should always be the case with atomic updates).

          The patch that I committed throws an exception if an atomic update request contains fields that are used to compute the signature. An atomic update request which does not modify the signature, proceeds as normal. This way we make sure that a document never contains a wrong signature.

          Do you agree that this is an acceptable compromise until a proper fix is in place?

          Show
          Shalin Shekhar Mangar added a comment - If the signature being generated was the unique key, then atomic updates should be able to proceed fine as long as the id field is specified (as should always be the case with atomic updates). The patch that I committed throws an exception if an atomic update request contains fields that are used to compute the signature. An atomic update request which does not modify the signature, proceeds as normal. This way we make sure that a document never contains a wrong signature. Do you agree that this is an acceptable compromise until a proper fix is in place?
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Shalin Shekhar Mangar
          http://svn.apache.org/viewvc?view=revision&revision=1433530

          SOLR-4016: Deduplication does not work with atomic/partial updates so disallow atomic update requests which change signature generating fields.

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Shalin Shekhar Mangar http://svn.apache.org/viewvc?view=revision&revision=1433530 SOLR-4016 : Deduplication does not work with atomic/partial updates so disallow atomic update requests which change signature generating fields.
          Hide
          Steve Rowe added a comment -

          Shalin, this can be resolved as fixed, right?

          Show
          Steve Rowe added a comment - Shalin, this can be resolved as fixed, right?
          Hide
          Shalin Shekhar Mangar added a comment -

          Fixed in 4.1 and trunk.

          Thanks Joel and Yonik.

          Show
          Shalin Shekhar Mangar added a comment - Fixed in 4.1 and trunk. Thanks Joel and Yonik.

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Joel Nothman
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development