Solr
  1. Solr
  2. SOLR-799

Add support for hash based exact/near duplicate document handling

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: update
    • Labels:
      None

      Description

      Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr.

      http://wiki.apache.org/solr/Deduplication

      1. SOLR-799.patch
        23 kB
        Mark Miller
      2. SOLR-799.patch
        36 kB
        Mark Miller
      3. SOLR-799.patch
        35 kB
        Mark Miller
      4. SOLR-799.patch
        33 kB
        Mark Miller

        Activity

        Hide
        Mark Miller added a comment -

        First pass for comments/reaction

        Show
        Mark Miller added a comment - First pass for comments/reaction
        Hide
        Andrzej Bialecki added a comment -

        Interesting development in light of NUTCH-442 Some comments:

        • in MD5Signature I suggest using the code from org.apache.hadoop.io.MD5Hash.toString() instead of BigInteger.
        • TextProfileSignature should contain a remark that it's copied from Nutch, since AFAIK the algorithm that it implements is currently used only in Nutch.
        • in Nutch the concept of a page Signature is only a part of the deduplication process. The other part is the algorithm to decide which copy to keep and which one to discard. In your patch the latest update always removes all other documents with the same signature. IMHO this decision should be isolated into a DuplicateDeletePolicy class that gets all duplicates and can decide (based on arbitrary criteria) which one to keep, with the default implementation that simply keeps the latest document.
        Show
        Andrzej Bialecki added a comment - Interesting development in light of NUTCH-442 Some comments: in MD5Signature I suggest using the code from org.apache.hadoop.io.MD5Hash.toString() instead of BigInteger. TextProfileSignature should contain a remark that it's copied from Nutch, since AFAIK the algorithm that it implements is currently used only in Nutch. in Nutch the concept of a page Signature is only a part of the deduplication process. The other part is the algorithm to decide which copy to keep and which one to discard. In your patch the latest update always removes all other documents with the same signature. IMHO this decision should be isolated into a DuplicateDeletePolicy class that gets all duplicates and can decide (based on arbitrary criteria) which one to keep, with the default implementation that simply keeps the latest document.
        Hide
        Mark Miller added a comment -

        Thanks for the review Andrzej. I've made the first two changes (I put at the top of TextProfileSignature that its 'borrowed' from Nutch and grabbed Hadoops MD5Hash class and stripped its Hadoop dependencies) and I'm investigating change 3. I'll put up another patch in a couple days.

        • Mark
        Show
        Mark Miller added a comment - Thanks for the review Andrzej. I've made the first two changes (I put at the top of TextProfileSignature that its 'borrowed' from Nutch and grabbed Hadoops MD5Hash class and stripped its Hadoop dependencies) and I'm investigating change 3. I'll put up another patch in a couple days. Mark
        Hide
        Grant Ingersoll added a comment -

        Haven't looked at the patch, but I agree that it is wise to separate the detection of duplication from the handling of found duplicates. The default can be to remove all as in the patch, but it should be easy to override. Scenarios I can see being useful:
        1. Prevent new insert
        2. Remove old (i.e. same as an update works now)
        3. Note the duplicate on the existing document in a "duplicates" field. This obviously requires either deleting and re-adding the doc, or Lucene to better support appending/updating fields, maybe via the column-stride payloads (if that ever happens). No need for this anytime soon.

        Show
        Grant Ingersoll added a comment - Haven't looked at the patch, but I agree that it is wise to separate the detection of duplication from the handling of found duplicates. The default can be to remove all as in the patch, but it should be easy to override. Scenarios I can see being useful: 1. Prevent new insert 2. Remove old (i.e. same as an update works now) 3. Note the duplicate on the existing document in a "duplicates" field. This obviously requires either deleting and re-adding the doc, or Lucene to better support appending/updating fields, maybe via the column-stride payloads (if that ever happens). No need for this anytime soon.
        Hide
        Yonik Seeley added a comment -

        I agree that it is wise to separate the detection of duplication from the handling of found duplicates

        Though in some implementations (like #2, which may be the default), detecting that duplicate and handling it are truly coupled... forcing a decoupling would not be a good thing in that case.

        Show
        Yonik Seeley added a comment - I agree that it is wise to separate the detection of duplication from the handling of found duplicates Though in some implementations (like #2, which may be the default), detecting that duplicate and handling it are truly coupled... forcing a decoupling would not be a good thing in that case.
        Hide
        Yonik Seeley added a comment -

        Some thoughts...

        • How should different "types" be handled (for example when we support binary fields). For example, different base64 encoders might use different line lengths or different line endings (CR/LF). Perhaps it's good enough to say that the string form must be identical, and leave it at that for now? The alternative would be signatures based on the Lucene Document about to be indexed.
        • It would be nice to be able to calculate a signature for a document w/o having to catenate all the fields together.
          Perhaps change calculate(String content) to something like calculate(Iterable<CharSequence> content)?

        An alternative option would be incremental hashing...

        Signature sig = ourSignatureCreator.create();
        sig.add(f1)
        sig.add(f2)
        sig.add(f3)
        String s = sig.getSignature()
        

        Looking at how TextProfileSignature works, i'd lean toward incremental hashing to avoid building yet another big string. Having a hashing object also opens up the possibility to easily add other method signatures for more efficient hashing.

        • It appears that if you put fields in a different order that the signature will change
        • It appears that documents with different field names but the same content will have the same signature.
        • I don't understand the dedup logic in DUH2... it seems like we want to delete by id and by sig... unfortunately there is no
          IndexWriter.updateDocument(Term[] terms, Document doc) so we'll have to do a separate non-atomic delete on the sig for now, right?
        • There's probably no need for a separate test solrconfig-deduplicate.xml if all it adds is an update processor. Tests could just explicitly specify the update handler on updates.
        Show
        Yonik Seeley added a comment - Some thoughts... How should different "types" be handled (for example when we support binary fields). For example, different base64 encoders might use different line lengths or different line endings (CR/LF). Perhaps it's good enough to say that the string form must be identical, and leave it at that for now? The alternative would be signatures based on the Lucene Document about to be indexed. It would be nice to be able to calculate a signature for a document w/o having to catenate all the fields together. Perhaps change calculate(String content) to something like calculate(Iterable<CharSequence> content)? An alternative option would be incremental hashing... Signature sig = ourSignatureCreator.create(); sig.add(f1) sig.add(f2) sig.add(f3) String s = sig.getSignature() Looking at how TextProfileSignature works, i'd lean toward incremental hashing to avoid building yet another big string. Having a hashing object also opens up the possibility to easily add other method signatures for more efficient hashing. It appears that if you put fields in a different order that the signature will change It appears that documents with different field names but the same content will have the same signature. I don't understand the dedup logic in DUH2... it seems like we want to delete by id and by sig... unfortunately there is no IndexWriter.updateDocument(Term[] terms, Document doc) so we'll have to do a separate non-atomic delete on the sig for now, right? There's probably no need for a separate test solrconfig-deduplicate.xml if all it adds is an update processor. Tests could just explicitly specify the update handler on updates.
        Hide
        Mark Miller added a comment -

        I agree that it is wise to separate the detection of duplication from the handling of found duplicates

        Though in some implementations (like #2, which may be the default), detecting that duplicate and handling it are truly coupled... forcing a decoupling would not be a good thing in that case.

        Still looking at this. Was hoping to avoid any of the old 'if solr crashes you can have 2 docs with same id in the index' type stuff. Guess I won't easily get away with that <g> Hopefully we can make it so the default implementation can still be as efficient and atomic.

        How should different "types" be handled (for example when we support binary fields). For example, different base64 encoders might use different line lengths or different line endings (CR/LF). Perhaps it's good enough to say that the string form must be identical, and leave it at that for now? The alternative would be signatures based on the Lucene Document about to be indexed.

        Yeah, may be best to worry about it when we support binary fields...would be nice to look forward though. I think returning a byte[] rather than a String will future proof the sig implementations a bit along those lines (though doesn't address your point)...still mulling - this shouldn't trip up Fuzzy hashing implementations to much, and so how exact should MD5Signature be...

        * It appears that if you put fields in a different order that the signature will change

        * It appears that documents with different field names but the same content will have the same signature.

        Two good points I have addressed.

        It would be nice to be able to calculate a signature for a document w/o having to catenate all the fields together.

        Perhaps change calculate(String content) to something like calculate(Iterable<CharSequence> content)?

        I like the idea of incremental as well.

        I don't understand the dedup logic in DUH2... it seems like we want to delete by id and by sig... unfortunately there is no

        IndexWriter.updateDocument(Term[] terms, Document doc) so we'll have to do a separate non-atomic delete on the sig for now, right?

        Another one I was hoping to get away with. My current strategy was to say that setting an update term means that updating by id is overridden and only the update Term is used - effectively, the update Term (signature) becomes the update id - and you can control whether the id factors into that update signature or not. Didn't get passes the goalie I suppose <g> I guess I give up on clean atomic imp and perhaps investigate update(terms[], doc) for the future. I wanted to deal with both signature and id, but figured its best to start with most efficient bare bones and work out.

        There's probably no need for a separate test solrconfig-deduplicate.xml if all it adds is an update processor. Tests could just explicitly specify the update handler on updates.

        Its mainly for me at the moment (testing config settings loading and what not), I'll be sure to pull it once the patch is done.

        Thanks for all of the feedback.

        Show
        Mark Miller added a comment - I agree that it is wise to separate the detection of duplication from the handling of found duplicates Though in some implementations (like #2, which may be the default), detecting that duplicate and handling it are truly coupled... forcing a decoupling would not be a good thing in that case. Still looking at this. Was hoping to avoid any of the old 'if solr crashes you can have 2 docs with same id in the index' type stuff. Guess I won't easily get away with that <g> Hopefully we can make it so the default implementation can still be as efficient and atomic. How should different "types" be handled (for example when we support binary fields). For example, different base64 encoders might use different line lengths or different line endings (CR/LF). Perhaps it's good enough to say that the string form must be identical, and leave it at that for now? The alternative would be signatures based on the Lucene Document about to be indexed. Yeah, may be best to worry about it when we support binary fields...would be nice to look forward though. I think returning a byte[] rather than a String will future proof the sig implementations a bit along those lines (though doesn't address your point)...still mulling - this shouldn't trip up Fuzzy hashing implementations to much, and so how exact should MD5Signature be... * It appears that if you put fields in a different order that the signature will change * It appears that documents with different field names but the same content will have the same signature. Two good points I have addressed. It would be nice to be able to calculate a signature for a document w/o having to catenate all the fields together. Perhaps change calculate(String content) to something like calculate(Iterable<CharSequence> content)? I like the idea of incremental as well. I don't understand the dedup logic in DUH2... it seems like we want to delete by id and by sig... unfortunately there is no IndexWriter.updateDocument(Term[] terms, Document doc) so we'll have to do a separate non-atomic delete on the sig for now, right? Another one I was hoping to get away with. My current strategy was to say that setting an update term means that updating by id is overridden and only the update Term is used - effectively, the update Term (signature) becomes the update id - and you can control whether the id factors into that update signature or not. Didn't get passes the goalie I suppose <g> I guess I give up on clean atomic imp and perhaps investigate update(terms[], doc) for the future. I wanted to deal with both signature and id, but figured its best to start with most efficient bare bones and work out. There's probably no need for a separate test solrconfig-deduplicate.xml if all it adds is an update processor. Tests could just explicitly specify the update handler on updates. Its mainly for me at the moment (testing config settings loading and what not), I'll be sure to pull it once the patch is done. Thanks for all of the feedback.
        Hide
        Andrzej Bialecki added a comment -

        +1 on the incremental sig calculation.

        Re: different "types" of signatures. Our experience in Nutch is that signature type is rarely changed, and we assume that this setting is selected once per lifetime of an index, i.e. there are never any mixed cases of documents with incompatible signatures. If we want to be sure that they are comparable, we could prepend a byte or two of unique signature type id - this way, even if a signature value matches but was calculated using other impl. the documents won't be considered duplicates, which is the way it should work, because different signature algorithms are incomparable.

        Re: signature as byte[] - I think it's better if we return byte[] from Signature, and until we support binary fields we just turn this into a hex string.

        Re: field ordering in DeduplicateUpdateProcessorFactory: I think that both sigFields (if defined) and any other document fields (if sigFields is undefined) should be first ordered in a predictable way (lexicographic?). Current patch uses a HashSet which doesn't guarantee any particular ordering - in fact the ordering may be different if you run the same code under different JVMs, which may introduce a random factor to the sig. calculation.

        Show
        Andrzej Bialecki added a comment - +1 on the incremental sig calculation. Re: different "types" of signatures. Our experience in Nutch is that signature type is rarely changed, and we assume that this setting is selected once per lifetime of an index, i.e. there are never any mixed cases of documents with incompatible signatures. If we want to be sure that they are comparable, we could prepend a byte or two of unique signature type id - this way, even if a signature value matches but was calculated using other impl. the documents won't be considered duplicates, which is the way it should work, because different signature algorithms are incomparable. Re: signature as byte[] - I think it's better if we return byte[] from Signature, and until we support binary fields we just turn this into a hex string. Re: field ordering in DeduplicateUpdateProcessorFactory: I think that both sigFields (if defined) and any other document fields (if sigFields is undefined) should be first ordered in a predictable way (lexicographic?). Current patch uses a HashSet which doesn't guarantee any particular ordering - in fact the ordering may be different if you run the same code under different JVMs, which may introduce a random factor to the sig. calculation.
        Hide
        Hoss Man added a comment -

        (disclaimer: haven't looked at the patch)

        Though in some implementations (like #2, which may be the default), detecting that duplicate and handling it are truly coupled... forcing a decoupling would not be a good thing in that case.

        I don't follow your reasoning. all the usecases i've seen mentioned seem like they could/would decouple very nicely...

        1. Prevent new insert – SignatureUpdateProcessor generates a signature and adds it as a field; AbortIfExistingUpdateProcessor aborts the update if a doc exists with a specific field in common with the doc to be added.
        2. Remove old (i.e. same as an update works now) – SignatureUpdateProcessor as mentioned before, and signature field is used as the uniqueKey field.
        3. Note the duplicate on the existing document in a "duplicates" field – SignatureUpdateProcessor as mentioned before; AnnotateDuplicatesProcessor checks for existing docs with a specific field in common with the doc to be added and executes additional opperations to "udpate" those docs, as well as the doc to be added.

        Show
        Hoss Man added a comment - (disclaimer: haven't looked at the patch) Though in some implementations (like #2, which may be the default), detecting that duplicate and handling it are truly coupled... forcing a decoupling would not be a good thing in that case. I don't follow your reasoning. all the usecases i've seen mentioned seem like they could/would decouple very nicely... 1. Prevent new insert – SignatureUpdateProcessor generates a signature and adds it as a field; AbortIfExistingUpdateProcessor aborts the update if a doc exists with a specific field in common with the doc to be added. 2. Remove old (i.e. same as an update works now) – SignatureUpdateProcessor as mentioned before, and signature field is used as the uniqueKey field. 3. Note the duplicate on the existing document in a "duplicates" field – SignatureUpdateProcessor as mentioned before; AnnotateDuplicatesProcessor checks for existing docs with a specific field in common with the doc to be added and executes additional opperations to "udpate" those docs, as well as the doc to be added.
        Hide
        Hoss Man added a comment -

        some misc comments from a user perspective based on the current state of the wiki...

        1) rather then a comma seperated <str> fields, we should just use an <arr>

        2) we should consider if/how we want to support using dynamicFields (ie: field name globs) in listing fields that are included in the signature)

        3) "By default, all non null fields on the document will be used." ... there's no such thing as a null field – there are fields that have no value, and there are fields whose value is an empty string, but no null value.

        4) yonik already asked other questions i had based on the wiki: how the order of fields in the update command affects the signature that gets computed – both in terms of fields with different names, and fields with the same name. the fields should probably be stable sorted by field name, so that the order of fields with teh same name affects the signature, but the relative order of fields with different names doesn't (since the order of fields with the same name actually affects the way the document is indexed, but the order of different field names does not)

        Show
        Hoss Man added a comment - some misc comments from a user perspective based on the current state of the wiki... 1) rather then a comma seperated <str> fields, we should just use an <arr> 2) we should consider if/how we want to support using dynamicFields (ie: field name globs) in listing fields that are included in the signature) 3) "By default, all non null fields on the document will be used." ... there's no such thing as a null field – there are fields that have no value, and there are fields whose value is an empty string, but no null value. 4) yonik already asked other questions i had based on the wiki: how the order of fields in the update command affects the signature that gets computed – both in terms of fields with different names, and fields with the same name. the fields should probably be stable sorted by field name, so that the order of fields with teh same name affects the signature, but the relative order of fields with different names doesn't (since the order of fields with the same name actually affects the way the document is indexed, but the order of different field names does not)
        Hide
        Mark Miller added a comment -

        1. Prevent new insert - SignatureUpdateProcessor generates a signature and adds it as a field; AbortIfExistingUpdateProcessor aborts the update if a doc exists with a specific field in common with the doc to be added.

        I like the idea of using UpdateProcessors for all of this. Its very clean compared to hacking around the DirectUpdateHandler. Unfortunately, I think AbortIfExistingUpdateProcessor would require locks that are too course. Ideally, you want to be able to inject code into the DirectUpdateHandlers 3 levels of locking (iw,sync(this),none). Thats whats needed for efficiency, but the cleanness gets whacked - its ugly to get that done, and doesn't really mesh with the UpdateHandler API thats been defined. The linking of DirectUpdateHandlers2's addDoc implementation to the whole idea...there would have to be changes that just don't seem worth the added functionality.

        Which leaves just hardcoding the support into DirectUpdateHandler, kind of like was done before for deletes/id dupes, and then just give options on the add doc cmd. Again I don't like it. But the anything else quickly breaks down for me. Any suggestions, insights?

        Show
        Mark Miller added a comment - 1. Prevent new insert - SignatureUpdateProcessor generates a signature and adds it as a field; AbortIfExistingUpdateProcessor aborts the update if a doc exists with a specific field in common with the doc to be added. I like the idea of using UpdateProcessors for all of this. Its very clean compared to hacking around the DirectUpdateHandler. Unfortunately, I think AbortIfExistingUpdateProcessor would require locks that are too course. Ideally, you want to be able to inject code into the DirectUpdateHandlers 3 levels of locking (iw,sync(this),none). Thats whats needed for efficiency, but the cleanness gets whacked - its ugly to get that done, and doesn't really mesh with the UpdateHandler API thats been defined. The linking of DirectUpdateHandlers2's addDoc implementation to the whole idea...there would have to be changes that just don't seem worth the added functionality. Which leaves just hardcoding the support into DirectUpdateHandler, kind of like was done before for deletes/id dupes, and then just give options on the add doc cmd. Again I don't like it. But the anything else quickly breaks down for me. Any suggestions, insights?
        Hide
        Hoss Man added a comment -

        If we assume for a minute that users who want to prevent or overwrite duplicates using a signature should always use the signature field as their uniqueKey, then doesn't use case#1 simplify to just running using a SignatureUpdateProcessor and then another processor that forces "allowDups=false,overwritePending=false,overwriteCommitted=false" ?

        Conceptually that seems right ... but at the moment DIH2 doesn't seem to care about allowDups at all (it only looks at overwriteCommitted and overwritePending to decide if it's allowing duplicates) and i'm not sure how to make it work off the top of my head, but assuming we need to muck with DIH2 internals in some way to make signatures (and aborting if the signature already exists) work, implementing the changes to happen for those combination of existing options seems like the cleanest approach.: the functional changes to DIH2 become generally useful to anyone who doesn't want to overwrite existing docs with the same id, regardless of whether they are computing a signature.

        the only hangup is whether we're okay with the initial assumption: that users who want duplicate detection by signature are willing to use the signature as the uniqueKey. If not then perhaps the cleanest way to support that would be to add more generalized "unique field" support ... a list of field names in the schema.xml and a (hopefully) simple call writer.deleteDocuments(Term[]) call in DIH2 should do the trick right? ... this could also be potentially useful to people for other purposes besides signatures, but i haven't thought throw all the permutations so i'm sure there would be funky corner cases.

        Show
        Hoss Man added a comment - If we assume for a minute that users who want to prevent or overwrite duplicates using a signature should always use the signature field as their uniqueKey, then doesn't use case#1 simplify to just running using a SignatureUpdateProcessor and then another processor that forces "allowDups=false,overwritePending=false,overwriteCommitted=false" ? Conceptually that seems right ... but at the moment DIH2 doesn't seem to care about allowDups at all (it only looks at overwriteCommitted and overwritePending to decide if it's allowing duplicates) and i'm not sure how to make it work off the top of my head, but assuming we need to muck with DIH2 internals in some way to make signatures (and aborting if the signature already exists) work, implementing the changes to happen for those combination of existing options seems like the cleanest approach.: the functional changes to DIH2 become generally useful to anyone who doesn't want to overwrite existing docs with the same id, regardless of whether they are computing a signature. the only hangup is whether we're okay with the initial assumption: that users who want duplicate detection by signature are willing to use the signature as the uniqueKey. If not then perhaps the cleanest way to support that would be to add more generalized "unique field" support ... a list of field names in the schema.xml and a (hopefully) simple call writer.deleteDocuments(Term[]) call in DIH2 should do the trick right? ... this could also be potentially useful to people for other purposes besides signatures, but i haven't thought throw all the permutations so i'm sure there would be funky corner cases.
        Hide
        Otis Gospodnetic added a comment -

        Haven't looked at the patch yet.
        Have looked at the Deduplication wiki page (and realize the stuff I'll write below is briefly mentioned there).
        Have skimmed the above comments.

        I want to bring up the use case that seems to have been mentioned already, but only in passing. The focus of the previous comments seems to be on index-time duplication detection. Another huge use case is search-time near-duplicate detection. Sometimes it's about straight forward field collapsing (collapsing adjacent docs with identical values in some field), but sometimes it's more complicated.

        For example, sometimes multiple fields need to be compared. Sometimes they have to be identical for collapsing to happen. Sometimes they only need to be "similar". How similarity is calculated is very application-dependent. I believe this similarity computation has to be completely open/extensible/overridable, allowing one to write a custom search component, examine returned hits and compare them using app-specific similarity....

        Ideally one would have the option not to save the document/field at index-time (for examination at search-time), since that prevents one from experimenting and dynamically changing the similarity computation.

        Here is one example. Imagine a field called "IDs" that can have 1 or more tokens in it and imagine docs with the following "IDs" get returned:

        1) id:aaa
        2) id:bbb
        3) id:ccc ddd
        4) id:aaa bbb
        5) id:eee ddd
        6) id:aaa

        A custom similarity may look at all of the above (e.g. a page's worth of hits) and decide that:
        1) and 4) are similar
        2) and 4) are also similar
        3) and 5) are similar
        1) and 4) and 6) are similar

        Another custom similarity may say that only 1) and 6) are similar because they are identical.

        My point is really that we have to leave it up to the application to provide similarity implementation, just like we make it possible for the app to provide a custom Lucene Similarity.

        Is the goal of this issue to make this possible?

        Show
        Otis Gospodnetic added a comment - Haven't looked at the patch yet. Have looked at the Deduplication wiki page (and realize the stuff I'll write below is briefly mentioned there). Have skimmed the above comments. I want to bring up the use case that seems to have been mentioned already, but only in passing. The focus of the previous comments seems to be on index-time duplication detection. Another huge use case is search-time near-duplicate detection. Sometimes it's about straight forward field collapsing (collapsing adjacent docs with identical values in some field), but sometimes it's more complicated. For example, sometimes multiple fields need to be compared. Sometimes they have to be identical for collapsing to happen. Sometimes they only need to be "similar". How similarity is calculated is very application-dependent. I believe this similarity computation has to be completely open/extensible/overridable, allowing one to write a custom search component, examine returned hits and compare them using app-specific similarity.... Ideally one would have the option not to save the document/field at index-time (for examination at search-time), since that prevents one from experimenting and dynamically changing the similarity computation. Here is one example. Imagine a field called "IDs" that can have 1 or more tokens in it and imagine docs with the following "IDs" get returned: 1) id:aaa 2) id:bbb 3) id:ccc ddd 4) id:aaa bbb 5) id:eee ddd 6) id:aaa A custom similarity may look at all of the above (e.g. a page's worth of hits) and decide that: 1) and 4) are similar 2) and 4) are also similar 3) and 5) are similar 1) and 4) and 6) are similar Another custom similarity may say that only 1) and 6) are similar because they are identical. My point is really that we have to leave it up to the application to provide similarity implementation, just like we make it possible for the app to provide a custom Lucene Similarity. Is the goal of this issue to make this possible?
        Hide
        Yonik Seeley added a comment -

        "overwriting" is implemented and supported in Lucene now (and we gain a number of benefits from using that). Conditionally adding a document, or testing if a document already exists, is not supported.

        Since we can't currently determine if something is a duplicate, it seems like this issue should go ahead with just a single option: whether to remove older documents with the same signature or not.

        Show
        Yonik Seeley added a comment - "overwriting" is implemented and supported in Lucene now (and we gain a number of benefits from using that). Conditionally adding a document, or testing if a document already exists, is not supported. Since we can't currently determine if something is a duplicate, it seems like this issue should go ahead with just a single option: whether to remove older documents with the same signature or not.
        Hide
        Yonik Seeley added a comment -

        Otis: this issue only handles the index side of things. The signature generating class is pluggable. Is there anything else needed on the indexing side?

        Show
        Yonik Seeley added a comment - Otis: this issue only handles the index side of things. The signature generating class is pluggable. Is there anything else needed on the indexing side?
        Hide
        Otis Gospodnetic added a comment -

        Thanks Yonik. Good thing I asked for the clarification, since Marks' issue description does mention search-time stuff (field collapsing).

        Mark: Do you still plan on tackling search-time duplicate/near-duplicate/similar doc detection? In a separate issue? Thanks.

        Show
        Otis Gospodnetic added a comment - Thanks Yonik. Good thing I asked for the clarification, since Marks' issue description does mention search-time stuff (field collapsing). Mark: Do you still plan on tackling search-time duplicate/near-duplicate/similar doc detection? In a separate issue? Thanks.
        Hide
        Mark Miller added a comment - - edited

        I find the pluggable replace/prevent/append policy idea appealing, but I have not yet found a great way to plug it into the UpdateHandler. Any approach other than sub-classing DirectUpdateHandler2 appears to lead to tying an IndexWriter to UpdateHandler. There is a connection now, UpdateHandler has a method to create a main IndexWriter, but further tying seems wrong without a stronger reason. That point is arguable, but in the end, sub-classing results in simpler code in any case. The trade off is that now you have a PreventDupesDirectUpdateHandler that extends DirectUpdateHandler2. This would have to be used in combination with the SignatureUpdateProcessor if you want to prevent dupes from entering the index. Other use cases (other than overwriting) would require another UpdateHandler. Less than ideal in both cases (subclassing, pluggable interface/class).

        Both approaches lead to less than ideal solutions beyond that as well . Because many docs that have been added to Solr might not yet be visible to an IndexReader, you have to keep a pending commit set of docs to check against. This list should be resilient against AddDoc, DelDocWquery, AddDoc, Commit. You'd essentially have to keep a mini index around to search against to accomplish this, due to delete by query. The other options are to either auto-commit sans a user commit before a delete, or just say we don't support that use case when using that UpdateHandler. None of it is very pretty.

        Another option is to do things with an UpdateProcessor. This is the most elegant solution really, but it requires putting big,coarse syncs around the more precise syncs in DirectUpdateHandler2. That may not be a huge deal, I am not sure. The previous two options allow you to maintain similar syncs as to what is already there. Beyond that, the UpdateProcessor approach still has the delete by query issues.

        Maybe we just do overwrite dupe for now? It has none of these issues. I am open to whatever path you guys want. The other use cases do have their place - we will just have to compromise some to get there. Or maybe there are other suggestions?

        Another point that was brought up is whether or not to delete any docs that match the update docs uniqueField id term, but not its similarity/update term. At the current moment, IMO, we shouldn't. You are choosing to use the updateTerm to do updates rather then the unique term. This allows you to have duplicate signatures but also uniqueField Ids for other operations (say delete). Also, if you already have a unique field that your using, it may be desirable to do dupe detection with a different field. There is always the option of setting the signature field to the uniqueField term. In the end, your call, I'll add it if we want it.

        As far as search time dupe collapsing, I think I could see a search component that takes the page numbers to collapse (start, end) and does dupe elimination on that range at query time. It wouldn't be very fast, and I'm not sure how useful page at a time collapsing is, but it would be fairly easy to do. Not sure that it fits into this issue, but certainly could share some of its classes.

        Show
        Mark Miller added a comment - - edited I find the pluggable replace/prevent/append policy idea appealing, but I have not yet found a great way to plug it into the UpdateHandler. Any approach other than sub-classing DirectUpdateHandler2 appears to lead to tying an IndexWriter to UpdateHandler. There is a connection now, UpdateHandler has a method to create a main IndexWriter, but further tying seems wrong without a stronger reason. That point is arguable, but in the end, sub-classing results in simpler code in any case. The trade off is that now you have a PreventDupesDirectUpdateHandler that extends DirectUpdateHandler2. This would have to be used in combination with the SignatureUpdateProcessor if you want to prevent dupes from entering the index. Other use cases (other than overwriting) would require another UpdateHandler. Less than ideal in both cases (subclassing, pluggable interface/class). Both approaches lead to less than ideal solutions beyond that as well . Because many docs that have been added to Solr might not yet be visible to an IndexReader, you have to keep a pending commit set of docs to check against. This list should be resilient against AddDoc, DelDocWquery, AddDoc, Commit. You'd essentially have to keep a mini index around to search against to accomplish this, due to delete by query. The other options are to either auto-commit sans a user commit before a delete, or just say we don't support that use case when using that UpdateHandler. None of it is very pretty. Another option is to do things with an UpdateProcessor. This is the most elegant solution really, but it requires putting big,coarse syncs around the more precise syncs in DirectUpdateHandler2. That may not be a huge deal, I am not sure. The previous two options allow you to maintain similar syncs as to what is already there. Beyond that, the UpdateProcessor approach still has the delete by query issues. Maybe we just do overwrite dupe for now? It has none of these issues. I am open to whatever path you guys want. The other use cases do have their place - we will just have to compromise some to get there. Or maybe there are other suggestions? Another point that was brought up is whether or not to delete any docs that match the update docs uniqueField id term, but not its similarity/update term. At the current moment, IMO, we shouldn't. You are choosing to use the updateTerm to do updates rather then the unique term. This allows you to have duplicate signatures but also uniqueField Ids for other operations (say delete). Also, if you already have a unique field that your using, it may be desirable to do dupe detection with a different field. There is always the option of setting the signature field to the uniqueField term. In the end, your call, I'll add it if we want it. As far as search time dupe collapsing, I think I could see a search component that takes the page numbers to collapse (start, end) and does dupe elimination on that range at query time. It wouldn't be very fast, and I'm not sure how useful page at a time collapsing is, but it would be fairly easy to do. Not sure that it fits into this issue, but certainly could share some of its classes.
        Hide
        Mark Miller added a comment -

        Whoops...thought I had posted this. Heres another draft I did a couple weeks ago.

        Fixes most of the comments brought up. There will need to be another draft - leaving the schema test file in for now as its still helpful to me.

        No prevent/append etc options here due to all the issues I mentioned, but I do have unposted code experimenting in the different directions if we want to try to go there anyway.

        Show
        Mark Miller added a comment - Whoops...thought I had posted this. Heres another draft I did a couple weeks ago. Fixes most of the comments brought up. There will need to be another draft - leaving the schema test file in for now as its still helpful to me. No prevent/append etc options here due to all the issues I mentioned, but I do have unposted code experimenting in the different directions if we want to try to go there anyway.
        Hide
        Yonik Seeley added a comment -

        Maybe we just do overwrite dupe for now?

        +1, as long as we don't do anything to preclude the other stuff - we just need to leave "room" in the config XML and the update API such that we don't have to break the back compatibility of this patch if/when future features are implemented.

        Another point that was brought up is whether or not to delete any docs that match the update docs uniqueField id term, but not its similarity/update term. You are choosing to use the updateTerm to do updates rather then the unique term.

        It seems like uniqueField should normally enforce uniqueness, regardless of what this component does. If one wants duplicate ids, then it seems like a different field should be used for that (other than the uniqueKey field). If one wants to delete only on the hash field, then they can make the hash field the id field.

        Show
        Yonik Seeley added a comment - Maybe we just do overwrite dupe for now? +1, as long as we don't do anything to preclude the other stuff - we just need to leave "room" in the config XML and the update API such that we don't have to break the back compatibility of this patch if/when future features are implemented. Another point that was brought up is whether or not to delete any docs that match the update docs uniqueField id term, but not its similarity/update term. You are choosing to use the updateTerm to do updates rather then the unique term. It seems like uniqueField should normally enforce uniqueness, regardless of what this component does. If one wants duplicate ids, then it seems like a different field should be used for that (other than the uniqueKey field). If one wants to delete only on the hash field, then they can make the hash field the id field.
        Hide
        Mark Miller added a comment -

        Ok. I cant muster up much of a defense for leaving it out I suppose.

        I'll polish off a final patch.

        • Mark

        On Nov 4, 2008, at 3:06 PM, "Yonik Seeley (JIRA)" <jira@apache.org>

        Show
        Mark Miller added a comment - Ok. I cant muster up much of a defense for leaving it out I suppose. I'll polish off a final patch. Mark On Nov 4, 2008, at 3:06 PM, "Yonik Seeley (JIRA)" <jira@apache.org>
        Hide
        Hoss Man added a comment -

        It seems like uniqueField should normally enforce uniqueness, regardless of what this component does.

        agreed.

        Whilei can imagine use cases for adding a signature field that is independent from the uniqueKey field (ie: query time duplicate pruning/collapsing) I'm having a really hard time thinking of any use cases where someone would need special deletion logic on a (non uniqueKey) signature field. if you want docs with identical signatures deleted, why wouldn't you make that the uniqueKey field? ... if you have both, you could really confuse the hell out of someone who doesn't understand why adding one doc deleted a different doc with a completely different uniqueKey.

        Show
        Hoss Man added a comment - It seems like uniqueField should normally enforce uniqueness, regardless of what this component does. agreed. Whilei can imagine use cases for adding a signature field that is independent from the uniqueKey field (ie: query time duplicate pruning/collapsing) I'm having a really hard time thinking of any use cases where someone would need special deletion logic on a (non uniqueKey) signature field. if you want docs with identical signatures deleted, why wouldn't you make that the uniqueKey field? ... if you have both, you could really confuse the hell out of someone who doesn't understand why adding one doc deleted a different doc with a completely different uniqueKey.
        Hide
        Mark Miller added a comment -

        This ensures the id field stays unique. Are there any other issues that need to be addressed? If not I'll work up one last patch removing the test config file and possibly adding a couple more tests.

        Show
        Mark Miller added a comment - This ensures the id field stays unique. Are there any other issues that need to be addressed? If not I'll work up one last patch removing the test config file and possibly adding a couple more tests.
        Hide
        Mark Miller added a comment -

        There's probably no need for a separate test solrconfig-deduplicate.xml if all it adds is an update processor. Tests could just explicitly specify the update handler on updates.

        Now that I look to fix this, I am not understanding - I don't need to change the update handler, I need to change the update chain...I am not seeing how that can be done dynamically...is it possible? If not I think I need the config xml.

        Show
        Mark Miller added a comment - There's probably no need for a separate test solrconfig-deduplicate.xml if all it adds is an update processor. Tests could just explicitly specify the update handler on updates. Now that I look to fix this, I am not understanding - I don't need to change the update handler, I need to change the update chain...I am not seeing how that can be done dynamically...is it possible? If not I think I need the config xml.
        Hide
        Mark Miller added a comment -

        I'm going to put up another patch for this soon. I'd like to have some getters on the factory for manual exploration of the settings.

        Show
        Mark Miller added a comment - I'm going to put up another patch for this soon. I'd like to have some getters on the factory for manual exploration of the settings.
        Hide
        Yonik Seeley added a comment -

        Now that I look to fix this, I am not understanding - I don't need to change the update handler, I need to change the update chain...I am not seeing how that can be done dynamically...is it possible?

        Yes, you can dynamically change an update processor: update.processor=hash

        Show
        Yonik Seeley added a comment - Now that I look to fix this, I am not understanding - I don't need to change the update handler, I need to change the update chain...I am not seeing how that can be done dynamically...is it possible? Yes, you can dynamically change an update processor: update.processor=hash
        Hide
        Mark Miller added a comment - - edited

        Okay, I see. I was too intent on changing the current chain - problem indeed goes away by just plugging in an entirely new chain.

        edit

        So it looks like I can change the UpdateRequestProcessorChain.chain because its package protected. So I can make a new chain in the test rather than add a config file - but I lose the testing of the parsing of the config settings for the SignatureUpdateProcessor. I suppose I am not so attached them, but I kind of like the idea of some of that hitting a test...no?

        Show
        Mark Miller added a comment - - edited Okay, I see. I was too intent on changing the current chain - problem indeed goes away by just plugging in an entirely new chain. edit So it looks like I can change the UpdateRequestProcessorChain.chain because its package protected. So I can make a new chain in the test rather than add a config file - but I lose the testing of the parsing of the config settings for the SignatureUpdateProcessor. I suppose I am not so attached them, but I kind of like the idea of some of that hitting a test...no?
        Hide
        Ryan McKinley added a comment -

        I'm not sure how you have the test set up, so I could be way off base.

        You could subclass SearchHandler and set the protected List<SearchComponent> components directly...

        Show
        Ryan McKinley added a comment - I'm not sure how you have the test set up, so I could be way off base. You could subclass SearchHandler and set the protected List<SearchComponent> components directly...
        Hide
        Yonik Seeley added a comment -

        Why not plug in an entirely new chain? That is one of the way it would be done for users of this component, right?

        <updateRequestProcessorChain name="hash"> [...]

        And then in the test send in update.processor=hash as a parameter.

        Show
        Yonik Seeley added a comment - Why not plug in an entirely new chain? That is one of the way it would be done for users of this component, right? <updateRequestProcessorChain name="hash"> [...] And then in the test send in update.processor=hash as a parameter.
        Hide
        Mark Miller added a comment -

        This patch fixes some oddness with how the enabled setting worked and removes the test solrconfig.xml file that was added.

        Wiki has been updated as well.

        • Mark
        Show
        Mark Miller added a comment - This patch fixes some oddness with how the enabled setting worked and removes the test solrconfig.xml file that was added. Wiki has been updated as well. Mark
        Hide
        Yonik Seeley added a comment -
        • fixed so that all values of a multi-valued field are included in the hash
        • changed so that no string additions are done for performance
        • moved HEX_CHARS to StrUtils
        • changed "fields" to be a comma separated list (per the wiki documentation... this may be more consistent if we allow this to be specified as a request parameter later, but it's subjective for sure. we could always add support for both arrays and comma separated lists).
        • changed the hashcode generation to work with any sized hash (was previously hardcoded to 16 bytes)
        • added lookup3ycs http://yonik.wordpress.com/2008/06/14/lookup3ycs-a-standard-high-performance-string-hash/ lookup3ycs can do hashes directly on strings (no need to convert to bytes first). I used the 64 bit variant, which is more than enough to prevent false collisions, and it resulted in a 27% speedup in total indexing time (after removing other cruft from the schema such as copyFields and default values).
        • tested with 10M documents to verify that no collisions occur with both MD5 and lookup3
        • Committed! Thanks Mark! And thanks to everyone else for the great feedback.
        Show
        Yonik Seeley added a comment - fixed so that all values of a multi-valued field are included in the hash changed so that no string additions are done for performance moved HEX_CHARS to StrUtils changed "fields" to be a comma separated list (per the wiki documentation... this may be more consistent if we allow this to be specified as a request parameter later, but it's subjective for sure. we could always add support for both arrays and comma separated lists). changed the hashcode generation to work with any sized hash (was previously hardcoded to 16 bytes) added lookup3ycs http://yonik.wordpress.com/2008/06/14/lookup3ycs-a-standard-high-performance-string-hash/ lookup3ycs can do hashes directly on strings (no need to convert to bytes first). I used the 64 bit variant, which is more than enough to prevent false collisions, and it resulted in a 27% speedup in total indexing time (after removing other cruft from the schema such as copyFields and default values). tested with 10M documents to verify that no collisions occur with both MD5 and lookup3 Committed! Thanks Mark! And thanks to everyone else for the great feedback.
        Hide
        Lance Norskog added a comment -

        I came into Solr with no search experience and it was quite a learning curve. The modular design of the configuration really helped, and we should maintain that modularity. There are two different designs: the design of the configuration and the design of the implementation. This comment only addresses the design of the configuration files.

        The patch as committed moves the specification of one field out of schema.xml file to another file. This breaks the modularity of the configurations. I suggest that the files should look like this:

        schema.xml:
        <field name="signatureField" type="signatureField" indexed="true" stored="false" signature="solr.TextProfileSignature" fields="product_name, model_t, *_s" />

        solrconfig.xml:
        <updateRequestProcessorChain name="dedupe">
        <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
        <string name="signatureField">signatureField</string>
        <bool name="enabled">false</bool>
        <bool name="overwriteDupes">true</bool>
        </processor>

        <processor class="solr.RunUpdateProcessorFactory" />
        </updateRequestProcessorChain>

        That is, the design of the signature field should go in schema.xml, and each updateRequest section should only describe how it is used with that section's declared name. Also, there should be no default field, since every field in the schema should be described in schema.xml.

        Show
        Lance Norskog added a comment - I came into Solr with no search experience and it was quite a learning curve. The modular design of the configuration really helped, and we should maintain that modularity. There are two different designs: the design of the configuration and the design of the implementation. This comment only addresses the design of the configuration files. The patch as committed moves the specification of one field out of schema.xml file to another file. This breaks the modularity of the configurations. I suggest that the files should look like this: schema.xml: <field name="signatureField" type="signatureField" indexed="true" stored="false" signature="solr.TextProfileSignature" fields="product_name, model_t, *_s" /> solrconfig.xml: <updateRequestProcessorChain name="dedupe"> <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"> <string name="signatureField">signatureField</string> <bool name="enabled">false</bool> <bool name="overwriteDupes">true</bool> </processor> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> That is, the design of the signature field should go in schema.xml, and each updateRequest section should only describe how it is used with that section's declared name. Also, there should be no default field, since every field in the schema should be described in schema.xml.
        Hide
        Shalin Shekhar Mangar added a comment -

        <field name="signatureField" type="signatureField" indexed="true" stored="false" signature="solr.TextProfileSignature" fields="product_name, model_t, *_s" />

        I don't think signatureField is a separate type. It is just a string, right?

        The patch as committed moves the specification of one field out of schema.xml file to another file.

        That is, the design of the signature field should go in schema.xml, and each updateRequest section should only describe how it is used with that section's declared name. Also, there should be no default field, since every field in the schema should be described in schema.xml.

        The design of the signature field goes into schema.xml right now too. The wiki clearly states the following about signatureField:

        The name of the field used to hold the fingerprint/signature. Be sure the field is defined in schema.xml. 
        

        <field name="signatureField" type="signatureField" indexed="true" stored="false" signature="solr.TextProfileSignature" fields="product_name, model_t, *_s" />

        I don't agree with the above. The method of computing the contents of the field should not be part of schema.xml. I do not understand your concern, maybe because I'm not very familiar with this feature.

        Show
        Shalin Shekhar Mangar added a comment - <field name="signatureField" type="signatureField" indexed="true" stored="false" signature="solr.TextProfileSignature" fields="product_name, model_t, *_s" /> I don't think signatureField is a separate type. It is just a string, right? The patch as committed moves the specification of one field out of schema.xml file to another file. That is, the design of the signature field should go in schema.xml, and each updateRequest section should only describe how it is used with that section's declared name. Also, there should be no default field, since every field in the schema should be described in schema.xml. The design of the signature field goes into schema.xml right now too. The wiki clearly states the following about signatureField: The name of the field used to hold the fingerprint/signature. Be sure the field is defined in schema.xml. <field name="signatureField" type="signatureField" indexed="true" stored="false" signature="solr.TextProfileSignature" fields="product_name, model_t, *_s" /> I don't agree with the above. The method of computing the contents of the field should not be part of schema.xml. I do not understand your concern, maybe because I'm not very familiar with this feature.
        Hide
        Hoss Man added a comment -

        The separation of concerns between schema.xml and solrconfig.xml has always been...

        • schema.xml: what is the data, what is it's nature, what are it's intrinsic properties?
        • solrconfig.xml: what can people do with your data, how can they use it?

        fields, fieldTypes, analyzers, copyFields go in the schema.xml because they are (in theory) intrinsic to the nature of your data regardless of where a given document comes from:

        • documents should only have one author
        • categoryName should always be tokenized in a particular way
        • prices need to sort numericly not lexigraphicallyy
        • any text indexed in the shortSummary field shoudl also be indexed in the searchableAbstract field
        • etc...

        request handlers that dictate how people can use the data are specified in solrconfig.xml – when searching data request handlers (which may leverage search componets) dictate what a user is allowed to get/see; when modifying an index request handlers (which may leverage update processors) dictate what data is allowed to come from various sources and in what formats.

        In short: as far as document indexing goes, the options configured in solrconfig.xml specify how to "build up" a Document object from user input, while the options in schema.xml specify how to "tear it down" into it's individual terms and values for indexing.

        With the near duplicate detection code, it is the schema's job to say which fields can exist in the input documents, including a signature field – but it is the solrconfig's job to decide how to compute that signature field ... after all: the computation might be different depending on the source of the data (ie: different processor chains could be configured for different request handlers)

        Show
        Hoss Man added a comment - The separation of concerns between schema.xml and solrconfig.xml has always been... schema.xml: what is the data, what is it's nature, what are it's intrinsic properties? solrconfig.xml: what can people do with your data, how can they use it? fields, fieldTypes, analyzers, copyFields go in the schema.xml because they are (in theory) intrinsic to the nature of your data regardless of where a given document comes from: documents should only have one author categoryName should always be tokenized in a particular way prices need to sort numericly not lexigraphicallyy any text indexed in the shortSummary field shoudl also be indexed in the searchableAbstract field etc... request handlers that dictate how people can use the data are specified in solrconfig.xml – when searching data request handlers (which may leverage search componets) dictate what a user is allowed to get/see; when modifying an index request handlers (which may leverage update processors) dictate what data is allowed to come from various sources and in what formats. In short: as far as document indexing goes, the options configured in solrconfig.xml specify how to "build up" a Document object from user input, while the options in schema.xml specify how to "tear it down" into it's individual terms and values for indexing. With the near duplicate detection code, it is the schema's job to say which fields can exist in the input documents, including a signature field – but it is the solrconfig's job to decide how to compute that signature field ... after all: the computation might be different depending on the source of the data (ie: different processor chains could be configured for different request handlers)
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4
        Hide
        Thomas Heigl added a comment -

        Hello,

        For my current project I need to implement an index-time mechanism to detect (near) duplicate documents. The TextProfileSignature available out-of-the-box (http://wiki.apache.org/solr/Deduplication) seems alright but does not use global collection statistics in deciding which terms will be used for calculating the signature.
        Most state-of-the-art hash-based duplication detection algorithms make use of this information to improve precision and recall (e.g. http://portal.acm.org/citation.cfm?id=506311&dl=GUIDE&coll=GUIDE&CFID=83187370&CFTOKEN=47052122)

        Is it possible to access collection statistics - especially IDF values for all non-discarded terms in the current document - from within an implementation of the Signature class?

        Kind regards,

        Thomas

        Show
        Thomas Heigl added a comment - Hello, For my current project I need to implement an index-time mechanism to detect (near) duplicate documents. The TextProfileSignature available out-of-the-box ( http://wiki.apache.org/solr/Deduplication ) seems alright but does not use global collection statistics in deciding which terms will be used for calculating the signature. Most state-of-the-art hash-based duplication detection algorithms make use of this information to improve precision and recall (e.g. http://portal.acm.org/citation.cfm?id=506311&dl=GUIDE&coll=GUIDE&CFID=83187370&CFTOKEN=47052122 ) Is it possible to access collection statistics - especially IDF values for all non-discarded terms in the current document - from within an implementation of the Signature class? Kind regards, Thomas
        Hide
        Andrzej Bialecki added a comment -

        This issue is closed - please use the mailing lists for discussions.

        Show
        Andrzej Bialecki added a comment - This issue is closed - please use the mailing lists for discussions.

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Mark Miller
          • Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development