i'm not entirely sure i'm understanding the problems. here's what i think i understand...
1) if you put dedup prior to distrib, then regardless of how it is configured it currently runs twice, which is bad - this seems like it is solved by
2) if you want to use dedup to generate a sig for the uniqueKey field, then it really has to come before distrib, otherwise forwarding to the leader just wont work. (again:
SOLR-2822 should make this do-able)
3) if you want to use dedup to generate a sig field that is not the uniqueKey field, AND you want to use "overwriteDupes=true" then (currently) this needs to happen after distrib, because otherwise the info about the deletion – tracked in
AddUpdateCommand.updateTerm - is lost when distrib does the forward. This seems like something that the distrib processor should deal with by ensuring it serializes/deserializes all of the key information in the AddUpdateCommand when sending/recieving a TOLEADER/FROMLEADER request (using
3a) it's not enough to ensure that the "updateTerm" is forwarded all the replicas in the shard, because other docs in other shards may have the same term value for the hash. (hence Markus's suggestions about doing a deleteByQuery – this should be in distribUP when AddUpdateCommand.updateTerm is non-null)
4) something about document cloning ... i still don't really understand this – not just in terms of dedup, but in generally i don't really understand why
SOLR-3215 is an issue assuming we fix SOLR-2822.