Solr
  1. Solr
  2. SOLR-3473

Distributed deduplication broken

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.8
    • Component/s: SolrCloud, update
    • Labels:
      None

      Description

      Solr's deduplication via the SignatureUpdateProcessor is broken for distributed updates on SolrCloud.

      Mark Miller:

      Looking again at the SignatureUpdateProcessor code, I think that indeed this won't currently work with distrib updates. Could you file a JIRA issue for that? The problem is that we convert update commands into solr documents - and that can cause a loss of info if an update proc modifies the update command.

      I think the reason that you see a multiple values error when you try the other order is because of the lack of a document clone (the other issue I mentioned a few emails back). Addressing that won't solve your issue though - we have to come up with a way to propagate the currently lost info on the update command.

      Please see the ML thread for the full discussion: http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html

      1. SOLR-3473-trunk-2.patch
        9 kB
        Markus Jelsma
      2. SOLR-3473.patch
        9 kB
        Hoss Man
      3. SOLR-3473.patch
        3 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Ken Ip added a comment -

          Hoss, any interest in bring this patch forward to trunk?

          Show
          Ken Ip added a comment - Hoss, any interest in bring this patch forward to trunk?
          Hide
          Robert Muir added a comment -

          Unassigned issues -> 4.1

          Show
          Robert Muir added a comment - Unassigned issues -> 4.1
          Hide
          Lance Norskog added a comment -

          It would be great to have this work in some form, even if it does not have the same API as before.

          Show
          Lance Norskog added a comment - It would be great to have this work in some form, even if it does not have the same API as before.
          Hide
          Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          Robert Muir added a comment - rmuir20120906-bulk-40-change
          Hide
          Markus Jelsma added a comment -

          Hello - Could the deleteByQuery issue you mention be fixed with SOLR-3473? I've attached an updated patch for today's trunk. The previous patch was missing the signature field but i added it to one schema. Now other tests seem to fail because they don't see the sig field but do use the update chain.

          Anyway, it seems the BasicDistributedZkTest passes but i'm not very sure, there's too much log output but it doesn't fail.

          Show
          Markus Jelsma added a comment - Hello - Could the deleteByQuery issue you mention be fixed with SOLR-3473 ? I've attached an updated patch for today's trunk. The previous patch was missing the signature field but i added it to one schema. Now other tests seem to fail because they don't see the sig field but do use the update chain. Anyway, it seems the BasicDistributedZkTest passes but i'm not very sure, there's too much log output but it doesn't fail.
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          Hoss Man added a comment -

          Updated patch to include my (meager) attempt at fixing the problem by making processAdd immediately execute a deleteByQuery if the add includes an updateTerm.

          I banged my head against a bunch of version mismatch errors to get into the current state of the patch, such that all the updates succeed, but the query assertions in the test still fail indicating that docs with duplicate signatures are making it into the index.

          On the up side: far fewer duplicates are making it into the index now then before the patch (when docs would only be deleted from the node that got the initial request if that node happened to be a shard leader)...

          wrong number of deduped docs (added 68 total) expected:<7> but was:<10>

          wrong number of deduped docs (added 71 total) expected:<7> but was:<8>

          wrong number of deduped docs (added 70 total) expected:<7> but was:<9>

          ...so apparently there is still some tiny corner case code path where dups are sneaking in (either that or the existing deleteByQuery code isn't reliable).

          I'm fairly certain i'm out of my depth at this point.

          Show
          Hoss Man added a comment - Updated patch to include my (meager) attempt at fixing the problem by making processAdd immediately execute a deleteByQuery if the add includes an updateTerm. I banged my head against a bunch of version mismatch errors to get into the current state of the patch, such that all the updates succeed, but the query assertions in the test still fail indicating that docs with duplicate signatures are making it into the index. On the up side: far fewer duplicates are making it into the index now then before the patch (when docs would only be deleted from the node that got the initial request if that node happened to be a shard leader)... wrong number of deduped docs (added 68 total) expected:<7> but was:<10> wrong number of deduped docs (added 71 total) expected:<7> but was:<8> wrong number of deduped docs (added 70 total) expected:<7> but was:<9> ...so apparently there is still some tiny corner case code path where dups are sneaking in (either that or the existing deleteByQuery code isn't reliable). I'm fairly certain i'm out of my depth at this point.
          Hide
          Hoss Man added a comment -

          test demonstrating problem.

          where SignatureUpdateProcessorFactory lives in the chain doesn't affect the outcome.

          Show
          Hoss Man added a comment - test demonstrating problem. where SignatureUpdateProcessorFactory lives in the chain doesn't affect the outcome.
          Hide
          Hoss Man added a comment -

          i'm not entirely sure i'm understanding the problems. here's what i think i understand...

          1) if you put dedup prior to distrib, then regardless of how it is configured it currently runs twice, which is bad - this seems like it is solved by SOLR-2822

          2) if you want to use dedup to generate a sig for the uniqueKey field, then it really has to come before distrib, otherwise forwarding to the leader just wont work. (again: SOLR-2822 should make this do-able)

          3) if you want to use dedup to generate a sig field that is not the uniqueKey field, AND you want to use "overwriteDupes=true" then (currently) this needs to happen after distrib, because otherwise the info about the deletion – tracked in
          AddUpdateCommand.updateTerm - is lost when distrib does the forward. This seems like something that the distrib processor should deal with by ensuring it serializes/deserializes all of the key information in the AddUpdateCommand when sending/recieving a TOLEADER/FROMLEADER request (using SOLR-2822 vernacular)

          3a) it's not enough to ensure that the "updateTerm" is forwarded all the replicas in the shard, because other docs in other shards may have the same term value for the hash. (hence Markus's suggestions about doing a deleteByQuery – this should be in distribUP when AddUpdateCommand.updateTerm is non-null)

          4) something about document cloning ... i still don't really understand this – not just in terms of dedup, but in generally i don't really understand why SOLR-3215 is an issue assuming we fix SOLR-2822.

          Show
          Hoss Man added a comment - i'm not entirely sure i'm understanding the problems. here's what i think i understand... 1) if you put dedup prior to distrib, then regardless of how it is configured it currently runs twice, which is bad - this seems like it is solved by SOLR-2822 2) if you want to use dedup to generate a sig for the uniqueKey field, then it really has to come before distrib, otherwise forwarding to the leader just wont work. (again: SOLR-2822 should make this do-able) 3) if you want to use dedup to generate a sig field that is not the uniqueKey field, AND you want to use "overwriteDupes=true" then (currently) this needs to happen after distrib, because otherwise the info about the deletion – tracked in AddUpdateCommand.updateTerm - is lost when distrib does the forward. This seems like something that the distrib processor should deal with by ensuring it serializes/deserializes all of the key information in the AddUpdateCommand when sending/recieving a TOLEADER/FROMLEADER request (using SOLR-2822 vernacular) 3a) it's not enough to ensure that the "updateTerm" is forwarded all the replicas in the shard, because other docs in other shards may have the same term value for the hash. (hence Markus's suggestions about doing a deleteByQuery – this should be in distribUP when AddUpdateCommand.updateTerm is non-null) 4) something about document cloning ... i still don't really understand this – not just in terms of dedup, but in generally i don't really understand why SOLR-3215 is an issue assuming we fix SOLR-2822 .
          Hide
          Mark Miller added a comment -

          To work around the problem of having the digest field as ID, could it not simply issue a deleteByQuery for the digest prior to adding it? Would that cause significant overhead for very large systems with many updates?

          Yeah, that might be an option - I don't know that it will be great perf wise, or race airtight wise, but it may a viable option.

          We would, from Nutch' point of view, certainly want to avoid changing the ID from URL to digest.

          Ah, interesting. If you are enforcing uniqueness by digest though, is this really a problem? It would only have to be in the Solr world that the id was the digest - and you could even call it something else and have an id:url field as well. Just thinking out loud.

          Or, perhaps we could make it so you could pick the hash field? Then hash on digest. If you are using overwrite=true, this should work right?

          Or perhaps someone else has some ideas...

          Show
          Mark Miller added a comment - To work around the problem of having the digest field as ID, could it not simply issue a deleteByQuery for the digest prior to adding it? Would that cause significant overhead for very large systems with many updates? Yeah, that might be an option - I don't know that it will be great perf wise, or race airtight wise, but it may a viable option. We would, from Nutch' point of view, certainly want to avoid changing the ID from URL to digest. Ah, interesting. If you are enforcing uniqueness by digest though, is this really a problem? It would only have to be in the Solr world that the id was the digest - and you could even call it something else and have an id:url field as well. Just thinking out loud. Or, perhaps we could make it so you could pick the hash field? Then hash on digest. If you are using overwrite=true, this should work right? Or perhaps someone else has some ideas...
          Hide
          Markus Jelsma added a comment -

          That makes sense indeed.

          To work around the problem of having the digest field as ID, could it not simply issue a deleteByQuery for the digest prior to adding it? Would that cause significant overhead for very large systems with many updates?

          We would, from Nutch' point of view, certainly want to avoid changing the ID from URL to digest.

          Show
          Markus Jelsma added a comment - That makes sense indeed. To work around the problem of having the digest field as ID, could it not simply issue a deleteByQuery for the digest prior to adding it? Would that cause significant overhead for very large systems with many updates? We would, from Nutch' point of view, certainly want to avoid changing the ID from URL to digest.
          Hide
          Mark Miller added a comment -

          My next response on the ML:


          I take that back - I think that may be the only way to make this work well. We need that document clone, which will let you put the dedupe proc after the distrib proc. I think in general, the dedupe proc will only work if your signature field is the id field though - otherwise, hash sharding that happens on the id field is going to cause a problem.

          Show
          Mark Miller added a comment - My next response on the ML: I take that back - I think that may be the only way to make this work well. We need that document clone, which will let you put the dedupe proc after the distrib proc. I think in general, the dedupe proc will only work if your signature field is the id field though - otherwise, hash sharding that happens on the id field is going to cause a problem.

            People

            • Assignee:
              Unassigned
              Reporter:
              Markus Jelsma
            • Votes:
              3 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development