Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.2, 7.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I'm planning to implement LSH. Which support query like this

      Find similar documents that have 0.8 or higher similar score with a given document. Similarity measurement can be cosine, jaccard, euclid..

      For example. Given following corpus

      1. Solr is an open source search engine based on Lucene
      2. Solr is an open source enterprise search engine based on Lucene
      3. Solr is an popular open source enterprise search engine based on Lucene
      4. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java

      We wanna find documents that have 0.6 score in jaccard measurement with this doc

      Solr is an open source search engine

      It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)

      1. LUCENE-6968.4.patch
        10 kB
        Tommaso Teofili
      2. LUCENE-6968.5.patch
        31 kB
        Andy Hind
      3. LUCENE-6968.6.patch
        1.0 kB
        Tommaso Teofili
      4. LUCENE-6968.patch
        37 kB
        Andy Hind
      5. LUCENE-6968.patch
        22 kB
        Cao Manh Dat
      6. LUCENE-6968.patch
        12 kB
        Cao Manh Dat

        Issue Links

          Activity

          Hide
          caomanhdat Cao Manh Dat added a comment -

          Initial patch. Can find similar documents based on Jaccard similarity.

          Show
          caomanhdat Cao Manh Dat added a comment - Initial patch. Can find similar documents based on Jaccard similarity.
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          This new patch include :

          • Change IntHash to MultiplyShiftHash (faster universal hashing)
          • CosineHashFilter support finding similar documents based on cosine similarity.
          • LSH#createSlowQuery + LSHSimilarity which support finding similar documents on arbitrary similarity and similar score results.

            doc1 : "Solr is an open source search engine based on Lucene"
            doc2 : "Solr is an open source enterprise search engine based on Lucene"
            doc3 : "Solr is an open source enterprise search engine based on Lucene"

            When we search for similarity of doc1. The result will contain:

            doc1 : score = 100
            doc2 : score = 90
            doc3 : score = 87

          Show
          caomanhdat Cao Manh Dat added a comment - - edited This new patch include : Change IntHash to MultiplyShiftHash (faster universal hashing) CosineHashFilter support finding similar documents based on cosine similarity. LSH#createSlowQuery + LSHSimilarity which support finding similar documents on arbitrary similarity and similar score results. doc1 : "Solr is an open source search engine based on Lucene" doc2 : "Solr is an open source enterprise search engine based on Lucene" doc3 : "Solr is an open source enterprise search engine based on Lucene" When we search for similarity of doc1. The result will contain: doc1 : score = 100 doc2 : score = 90 doc3 : score = 87
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Cao Manh Dat, interestiing ticket!

          Can you describe some of the potential uses for LSH. For example will this be helpful in approximating K nearest neighbor?

          Show
          joel.bernstein Joel Bernstein added a comment - Cao Manh Dat , interestiing ticket! Can you describe some of the potential uses for LSH. For example will this be helpful in approximating K nearest neighbor?
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          Yes, It kinda like finding K nearest neighbor. But there a different here:

          • In K nearest neighbor, we use K as parameter which decide how many nearest neighbor we wanna retrieve.
          • In LSH we use a radius as parameter. The radius define minimum distance between center doc and other docs. (LSH is also far faster than K nearest neighbor)

          In both case, the result will be rank by distance to the center doc.

          Show
          caomanhdat Cao Manh Dat added a comment - - edited Yes, It kinda like finding K nearest neighbor. But there a different here: In K nearest neighbor, we use K as parameter which decide how many nearest neighbor we wanna retrieve. In LSH we use a radius as parameter. The radius define minimum distance between center doc and other docs. (LSH is also far faster than K nearest neighbor) In both case, the result will be rank by distance to the center doc.
          Hide
          teofili Tommaso Teofili added a comment -

          since this has also been mentioned in SOLR-7739 maybe the LSH filter can be used to create a Lucene nearest neighbour Classifier which uses such filter (instead of the existing k nearest neighbour one).

          Show
          teofili Tommaso Teofili added a comment - since this has also been mentioned in SOLR-7739 maybe the LSH filter can be used to create a Lucene nearest neighbour Classifier which uses such filter (instead of the existing k nearest neighbour one).
          Hide
          teofili Tommaso Teofili added a comment -

          I was having a second look at this patch, why is LSHSimilarity returning 1 in all its methods ? Is that intended or does it mean it still needs to be implemented?

          Show
          teofili Tommaso Teofili added a comment - I was having a second look at this patch, why is LSHSimilarity returning 1 in all its methods ? Is that intended or does it mean it still needs to be implemented?
          Hide
          teofili Tommaso Teofili added a comment -

          Cao Manh Dat, I'd like to commit this patch shortly, my only concern is around the similarity implementation, having all methods returning 1 doesn't seem correct to me, or am I missing something ?

          Show
          teofili Tommaso Teofili added a comment - Cao Manh Dat , I'd like to commit this patch shortly, my only concern is around the similarity implementation, having all methods returning 1 doesn't seem correct to me, or am I missing something ?
          Hide
          joel.bernstein Joel Bernstein added a comment -

          HI, I believe there will be some work forthcoming from Alfresco on this ticket. We may want to hold off until those patches go up. Hopefully I will have an update on this soon.

          Tommaso Teofili, if you feel strongly though that this is ready to commit, then go for it. We can always make adjustments after the committal as well.

          Show
          joel.bernstein Joel Bernstein added a comment - HI, I believe there will be some work forthcoming from Alfresco on this ticket. We may want to hold off until those patches go up. Hopefully I will have an update on this soon. Tommaso Teofili , if you feel strongly though that this is ready to commit, then go for it. We can always make adjustments after the committal as well.
          Hide
          andyhind Andy Hind added a comment - - edited

          Hi

          It would be quite common to use min hashing after shingling. At this point the number of possible word combinations vs the size of the hash is important. With shingles of 5 words from 100,000 that is 10e25 combinations. Some naive processing of the ~500k Enron emails (splitting on white space, case folding and 5 word shingles) gives ~52M combinations. So a long hash would be better at 1.8e19. I have not yet looked at a larger corpus.

          The LSH query is neat. However the logic can give banding where the last band is uneven. In the patch I think the last band would be dropped unless bands * rows in band = # of hashes

          The underlying state of the source filter may also be lost (if using shingling)

          I do not believe the similarity is required at all. I think you can get Jaccard distance using constant score queries and disabling coordination on the boolean query.

          I went for 128-bit hashes, or a 32 bit hash identifier + 96 bit hash with a bit more flexibility allowing a minimum set of hash values for a bunch of hashes. There is clearly some trade off for speed of hashing and over representing short documents. The minimum set may be a solution to this. I think there is some interesting research there.

          I will add my patch inspired by the original .... and apologise for the mixed formatting in advance ......

          Show
          andyhind Andy Hind added a comment - - edited Hi It would be quite common to use min hashing after shingling. At this point the number of possible word combinations vs the size of the hash is important. With shingles of 5 words from 100,000 that is 10e25 combinations. Some naive processing of the ~500k Enron emails (splitting on white space, case folding and 5 word shingles) gives ~52M combinations. So a long hash would be better at 1.8e19. I have not yet looked at a larger corpus. The LSH query is neat. However the logic can give banding where the last band is uneven. In the patch I think the last band would be dropped unless bands * rows in band = # of hashes The underlying state of the source filter may also be lost (if using shingling) I do not believe the similarity is required at all. I think you can get Jaccard distance using constant score queries and disabling coordination on the boolean query. I went for 128-bit hashes, or a 32 bit hash identifier + 96 bit hash with a bit more flexibility allowing a minimum set of hash values for a bunch of hashes. There is clearly some trade off for speed of hashing and over representing short documents. The minimum set may be a solution to this. I think there is some interesting research there. I will add my patch inspired by the original .... and apologise for the mixed formatting in advance ......
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          What's a wonderful patch. The code is optimized, sure that the the index will be much smaller!

          But the patch keep some lowest values for each position, did it affect the formula

           Pr(h(s1) = h(s2)) = Jaccard(s1,s2) 
          Show
          caomanhdat Cao Manh Dat added a comment - - edited What's a wonderful patch. The code is optimized, sure that the the index will be much smaller! But the patch keep some lowest values for each position, did it affect the formula Pr(h(s1) = h(s2)) = Jaccard(s1,s2)
          Hide
          andyhind Andy Hind added a comment - - edited

          The argument here says it is pretty much the same.

          https://en.wikipedia.org/wiki/MinHash
          

          The plan was to offer both options.

          With respect to banding and finding docs related to some start document, the number of hashes may depend on the start document.

          Let's start with 5 word shingles, one hash and keep the minimum 100 hash values. For a five word document we get one hash. For a 100 word doc where all the shingles/words are the same we get one hash. For all different shingles we get 96 hashes.

          If we have 100 different hashes and keep the lowest one all the above cases end up with 100 hashes.

          So back to banding. With minimum sets, you need to look and see how many hashes you really got and then do the banding. Comparing a small documents/snippet (where we get 10 hashes in the fingerprint)with a much larger document (where we get 100 hashes) is an interesting case to consider. Starting with the small document there are fewer bits to match in the generated query. With 100 hashes from the small document I think you end up in the roughly same place, except for small snippets. Any given band is more likely to have the same shingle hashed different ways.

          With a 100 hash fingerprint, sampling for 100 words is great but not so great for 100,000 words. With a minimum set we have the option to generate a finger print related to the document length and other features.

          There is also an argument for a winnowing approach.

          Show
          andyhind Andy Hind added a comment - - edited The argument here says it is pretty much the same. https: //en.wikipedia.org/wiki/MinHash The plan was to offer both options. With respect to banding and finding docs related to some start document, the number of hashes may depend on the start document. Let's start with 5 word shingles, one hash and keep the minimum 100 hash values. For a five word document we get one hash. For a 100 word doc where all the shingles/words are the same we get one hash. For all different shingles we get 96 hashes. If we have 100 different hashes and keep the lowest one all the above cases end up with 100 hashes. So back to banding. With minimum sets, you need to look and see how many hashes you really got and then do the banding. Comparing a small documents/snippet (where we get 10 hashes in the fingerprint)with a much larger document (where we get 100 hashes) is an interesting case to consider. Starting with the small document there are fewer bits to match in the generated query. With 100 hashes from the small document I think you end up in the roughly same place, except for small snippets. Any given band is more likely to have the same shingle hashed different ways. With a 100 hash fingerprint, sampling for 100 words is great but not so great for 100,000 words. With a minimum set we have the option to generate a finger print related to the document length and other features. There is also an argument for a winnowing approach.
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          Thanks for the link. I totally agree that keeping some lowest values for single hash function would be better.

          But in the wiki doc. It pointed out that the estimator formulation for "variant with a single hash function" is not same as the estimator formulation for "variant with many hash function". So the generated query must be different for each case.

          For example, in case we use single hash function and keep some lowest values :

          • We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7]
          • So jaccard(A,B) = size( hk(A U B) ∩ hk(A) ∩ hk(B) ) / k = size( [5] ) / k = 0.2
          Show
          caomanhdat Cao Manh Dat added a comment - - edited Thanks for the link. I totally agree that keeping some lowest values for single hash function would be better. But in the wiki doc. It pointed out that the estimator formulation for "variant with a single hash function" is not same as the estimator formulation for "variant with many hash function". So the generated query must be different for each case. For example, in case we use single hash function and keep some lowest values : We have doc A = [1, 2, 5, 6, 7] , doc B = [3, 4, 5, 6, 7] So jaccard(A,B) = size( hk(A U B) ∩ hk(A) ∩ hk(B) ) / k = size( [5] ) / k = 0.2
          Hide
          andyhind Andy Hind added a comment - - edited

          This comes down to "what is a good estimate of |A U B|" and do we need it for the use case.

          For query, the main use case is finding documents like one source document. So we are comparing Doc A with all other documents. What we need is a measure that is fair for A -> B, A -> C. We probably do not care about B -> C. If we take the fingerprint from A and just OR the bits together into a big query we have a consistent measure of similarity of A with any other document. This particular measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for this use case that may not matter. This measure contains both inclusion and duplication which may be a good thing. It is also pretty intuitive what it means. This is (A ∩ B)/|A|.

          If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of |A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - |A ∩ B|), or some similar estimate. We could use the size of the fingerprint sets. We could keep the full ordered set of hashes and have extra statistics like the total number of hashes and total number of unique hashes.

          For two short documents, where there are fewer fingerprints than the maximum, we have the full sets.
          For two larger docs we have an estimate of these based in the min hash sets. You can argue "min of many hashes" is a random sample with replacement and "min set of one hash" is a random sample without replacement; if your hash is good. If the sample is small compared with the set of all hashes the arguments converge.

          If we were doing arbitrary comparisons between any pair of documents then we would have to use an unbiased estimator. Finding candidate pairs, moving onto clustering, ...

          Show
          andyhind Andy Hind added a comment - - edited This comes down to "what is a good estimate of |A U B|" and do we need it for the use case. For query, the main use case is finding documents like one source document. So we are comparing Doc A with all other documents. What we need is a measure that is fair for A -> B, A -> C. We probably do not care about B -> C. If we take the fingerprint from A and just OR the bits together into a big query we have a consistent measure of similarity of A with any other document. This particular measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for this use case that may not matter. This measure contains both inclusion and duplication which may be a good thing. It is also pretty intuitive what it means. This is (A ∩ B)/|A|. If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of |A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - |A ∩ B|), or some similar estimate. We could use the size of the fingerprint sets. We could keep the full ordered set of hashes and have extra statistics like the total number of hashes and total number of unique hashes. For two short documents, where there are fewer fingerprints than the maximum, we have the full sets. For two larger docs we have an estimate of these based in the min hash sets. You can argue "min of many hashes" is a random sample with replacement and "min set of one hash" is a random sample without replacement; if your hash is good. If the sample is small compared with the set of all hashes the arguments converge. If we were doing arbitrary comparisons between any pair of documents then we would have to use an unbiased estimator. Finding candidate pairs, moving onto clustering, ...
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Thanks, that make sense!

          Show
          caomanhdat Cao Manh Dat added a comment - Thanks, that make sense!
          Hide
          teofili Tommaso Teofili added a comment -

          attaching a slightly modified version of the last patch:

          • added service loader binding for MinHashFilterFactory
          • added IntelliJ required dependencies
          • minor fixes to javadoc (and code style to be consistent with rest of the codebase)

          I've noticed though that the filter doesn't perfectly align the end offset attribute (being beyond the input length), in fact if I run all tests the TestFactories one fails with the following:

          Suite: org.apache.lucene.analysis.core.TestFactories
             [junit4]   2> TEST FAIL: useCharFilter=true text='uuzfmo'
             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestFactories -Dtests.method=test -Dtests.seed=9CF9D39BDAB31A80 -Dtests.slow=true -Dtests.locale=sv -Dtests.timezone=Asia/Choibalsan -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
             [junit4] FAILURE 13.5s J3 | TestFactories.test <<<
             [junit4]    > Throwable #1: java.lang.AssertionError: endOffset must be <= finalOffset: got endOffset=7 vs finalOffset=6
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([9CF9D39BDAB31A80:14ADEC41744F7778]:0)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:211)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:300)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:304)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:828)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:627)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:525)
             [junit4]    > 	at org.apache.lucene.analysis.core.TestFactories.doTestTokenFilter(TestFactories.java:105)
             [junit4]    > 	at org.apache.lucene.analysis.core.TestFactories.test(TestFactories.java:58)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
             [junit4]   2> NOTE: test params are: codec=Lucene60, sim=RandomSimilarity(queryNorm=true,coord=no): {}, locale=sv, timezone=Asia/Choibalsan
             [junit4]   2> NOTE: Mac OS X 10.11.3 x86_64/Oracle Corporation 1.8.0_45 (64-bit)/cpus=8,threads=1,free=197816752,total=324534272
             [junit4]   2> NOTE: All tests run in this JVM: [TestPatternCaptureGroupTokenFilter, TestSnowballPorterFilterFactory, TestBulgarianStemFilterFactory, TestFactories]
          
          Show
          teofili Tommaso Teofili added a comment - attaching a slightly modified version of the last patch: added service loader binding for MinHashFilterFactory added IntelliJ required dependencies minor fixes to javadoc (and code style to be consistent with rest of the codebase) I've noticed though that the filter doesn't perfectly align the end offset attribute (being beyond the input length), in fact if I run all tests the TestFactories one fails with the following: Suite: org.apache.lucene.analysis.core.TestFactories [junit4] 2> TEST FAIL: useCharFilter=true text='uuzfmo' [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestFactories -Dtests.method=test -Dtests.seed=9CF9D39BDAB31A80 -Dtests.slow=true -Dtests.locale=sv -Dtests.timezone=Asia/Choibalsan -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] FAILURE 13.5s J3 | TestFactories.test <<< [junit4] > Throwable #1: java.lang.AssertionError: endOffset must be <= finalOffset: got endOffset=7 vs finalOffset=6 [junit4] > at __randomizedtesting.SeedInfo.seed([9CF9D39BDAB31A80:14ADEC41744F7778]:0) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:211) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:300) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:304) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:828) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:627) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:525) [junit4] > at org.apache.lucene.analysis.core.TestFactories.doTestTokenFilter(TestFactories.java:105) [junit4] > at org.apache.lucene.analysis.core.TestFactories.test(TestFactories.java:58) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: test params are: codec=Lucene60, sim=RandomSimilarity(queryNorm=true,coord=no): {}, locale=sv, timezone=Asia/Choibalsan [junit4] 2> NOTE: Mac OS X 10.11.3 x86_64/Oracle Corporation 1.8.0_45 (64-bit)/cpus=8,threads=1,free=197816752,total=324534272 [junit4] 2> NOTE: All tests run in this JVM: [TestPatternCaptureGroupTokenFilter, TestSnowballPorterFilterFactory, TestBulgarianStemFilterFactory, TestFactories]
          Hide
          rcmuir Robert Muir added a comment -

          analysis-common library cannot have any external dependencies.

          so If this filter is going to depend on guava, it needs to be its own analysis module. Just like all other analysis modules that have any third party dependencies.

          I would try to remove the guava dependency at all costs anyway. It causes hell for developers to depend on guava.

          Show
          rcmuir Robert Muir added a comment - analysis-common library cannot have any external dependencies. so If this filter is going to depend on guava, it needs to be its own analysis module. Just like all other analysis modules that have any third party dependencies. I would try to remove the guava dependency at all costs anyway. It causes hell for developers to depend on guava.
          Hide
          rcmuir Robert Muir added a comment -

          also, these analyzers should not be tested with queries. Instead, just use the asserts in BaseTokenStreamTestCase to check that the correct tokens are output.

          I see some solr code was copied so that things can be tested with queries. I know that solr likes to test this way, but nobody wants to debug a test failure that tests an analyzer this way.

          Just keep in mind we have a ton of analyzers, and when things go wrong (and they do), people have to debug them or even do refactorings across hundreds of them. That is why we have a consistent test class (BaseTokenStreamTestCase).

          Show
          rcmuir Robert Muir added a comment - also, these analyzers should not be tested with queries. Instead, just use the asserts in BaseTokenStreamTestCase to check that the correct tokens are output. I see some solr code was copied so that things can be tested with queries. I know that solr likes to test this way, but nobody wants to debug a test failure that tests an analyzer this way. Just keep in mind we have a ton of analyzers, and when things go wrong (and they do), people have to debug them or even do refactorings across hundreds of them. That is why we have a consistent test class (BaseTokenStreamTestCase).
          Hide
          teofili Tommaso Teofili added a comment -

          thanks Robert, I agree on all of your suggestions.
          The error reported by the TestFactories witnesses what you say, I can work on those more appropriate tests; however I would also like to keep the "query" ones as to keep an eye on whether one of the use case for this filter is supported as expected (a sort of integration test).
          Removing Guava might of course be possible (and I'd prefer to do that), although rewriting the hashcode stuff is likely to be a bit annoying.

          Show
          teofili Tommaso Teofili added a comment - thanks Robert, I agree on all of your suggestions. The error reported by the TestFactories witnesses what you say, I can work on those more appropriate tests; however I would also like to keep the "query" ones as to keep an eye on whether one of the use case for this filter is supported as expected (a sort of integration test). Removing Guava might of course be possible (and I'd prefer to do that), although rewriting the hashcode stuff is likely to be a bit annoying.
          Hide
          rcmuir Robert Muir added a comment -

          We don't need any "integration tests" or "end to end tests".

          -1 to those. Just think about it. Tokenizers only output one thing, and that is tokens. And that is what we should test.

          Show
          rcmuir Robert Muir added a comment - We don't need any "integration tests" or "end to end tests". -1 to those. Just think about it. Tokenizers only output one thing, and that is tokens. And that is what we should test.
          Hide
          andyhind Andy Hind added a comment -

          I agree a pure token stream test makes sense. The only concern I have is about testing token filters chained together. Chaining shingle generation with min hashing requires that the underlying token stream has its state reset correctly for reuse. As I missed this, I added a test to cover it. Is there somewhere else in the test framework that covers this case? Some randomised chaining of filters?? Perhaps chaining is more of a SOLR thing.

          I would prefer to stick with a 128/96 bit hash. The link below [1] "suggests" 5-shingles become well distributed. Link [2] says upto 2/3 of all possible trigrams have been seen in 30 years of news articles. So it seems we can expect to see many of the possible 5-shingles. Some bioinformatic use cases may also require this.

          [1] http://googleresearch.blogspot.co.uk/2006/08/all-our-n-gram-are-belong-to-you.html
          [2] http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

          I was not that keen to add Guava! However, it was already there somewhere.
          I am happy if this moves off into a separate module. I will also look to see how this dependency could be removed.

          Perhaps we should have some time to consider how to include the fingerprint length (sum of the min set size over all hashes) to support an unbiased query. An unbiased query would be more difficult to build correctly. Some fingerprint/LSH query support and tests may make sense. Some other statistics may also be useful in generating faster queries that find similar documents using some threshold and probability of meeting that threshold.

          Show
          andyhind Andy Hind added a comment - I agree a pure token stream test makes sense. The only concern I have is about testing token filters chained together. Chaining shingle generation with min hashing requires that the underlying token stream has its state reset correctly for reuse. As I missed this, I added a test to cover it. Is there somewhere else in the test framework that covers this case? Some randomised chaining of filters?? Perhaps chaining is more of a SOLR thing. I would prefer to stick with a 128/96 bit hash. The link below [1] "suggests" 5-shingles become well distributed. Link [2] says upto 2/3 of all possible trigrams have been seen in 30 years of news articles. So it seems we can expect to see many of the possible 5-shingles. Some bioinformatic use cases may also require this. [1] http://googleresearch.blogspot.co.uk/2006/08/all-our-n-gram-are-belong-to-you.html [2] http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf I was not that keen to add Guava! However, it was already there somewhere. I am happy if this moves off into a separate module. I will also look to see how this dependency could be removed. Perhaps we should have some time to consider how to include the fingerprint length (sum of the min set size over all hashes) to support an unbiased query. An unbiased query would be more difficult to build correctly. Some fingerprint/LSH query support and tests may make sense. Some other statistics may also be useful in generating faster queries that find similar documents using some threshold and probability of meeting that threshold.
          Show
          andyhind Andy Hind added a comment - Yonik Seeley has murmurhash3_x64_128 here https://github.com/yonik/java_util/blob/master/src/util/hash/MurmurHash3.java
          Hide
          andyhind Andy Hind added a comment - - edited

          After a bit more digging, the single hash and keeping the minimum set can be improved.

          See:
          [1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf
          [2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf

          In summary: rather than keep the minimum set, split the hash space up into 500 buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To fill an empty bucket, take the minimum from the next non-empty bucket on the right with rotation.

          Show
          andyhind Andy Hind added a comment - - edited After a bit more digging, the single hash and keeping the minimum set can be improved. See: [1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf [2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf In summary: rather than keep the minimum set, split the hash space up into 500 buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To fill an empty bucket, take the minimum from the next non-empty bucket on the right with rotation.
          Hide
          andyhind Andy Hind added a comment -

          I have attached an updated patch.

          This addresses the following issues

          1. Support for single hash, split into ranges with a minimum for each range
          2. Remove end to end tests and beefed up unit tests
          3. Remove Guava in favour of Yonik's murmur hash implementation. (Some duplication here with SOLR)
          4. Fixed alignment and "evil" test case issue
          5. TestFactories passes > 200 times (some Japanese Number tokenisation failures)
          6. Fixed formatting

          There were issue applying patch 4 on its own or over the previous patch. I believe I have included everything other then the IDE related bits.

          Show
          andyhind Andy Hind added a comment - I have attached an updated patch. This addresses the following issues Support for single hash, split into ranges with a minimum for each range Remove end to end tests and beefed up unit tests Remove Guava in favour of Yonik's murmur hash implementation. (Some duplication here with SOLR) Fixed alignment and "evil" test case issue TestFactories passes > 200 times (some Japanese Number tokenisation failures) Fixed formatting There were issue applying patch 4 on its own or over the previous patch. I believe I have included everything other then the IDE related bits.
          Hide
          teofili Tommaso Teofili added a comment -

          thanks a lot Andy, it generally looks much better, I will have a look at failures on point 5 (and eventually add the IDE fixes back).

          Show
          teofili Tommaso Teofili added a comment - thanks a lot Andy, it generally looks much better, I will have a look at failures on point 5 (and eventually add the IDE fixes back).
          Hide
          teofili Tommaso Teofili added a comment -

          minor fixes to the last patch, apart from that I cannot see any error, even with Japanese locale explicitly set

          ant clean test -Dtests.multiplier=1000000 -Dtestcase=MinHashFilterTest -Dtests.locale=ja
          

          Andy Hind do you have the seed of a failed execution ?

          Show
          teofili Tommaso Teofili added a comment - minor fixes to the last patch, apart from that I cannot see any error, even with Japanese locale explicitly set ant clean test -Dtests.multiplier=1000000 -Dtestcase=MinHashFilterTest -Dtests.locale=ja Andy Hind do you have the seed of a failed execution ?
          Hide
          andyhind Andy Hind added a comment - - edited

          Hi Tommaso, the MinHashFilterTest was running fine. It was JapaneseNumberFilter that was failing intermittently. I think on one of the evil test cases.

          LongPair should implement equals (and probably hashCode if it will be reused) as it goes into a TreeSet. An over sight on my part.

          FWIW, as far as I can tell, the change in patch 6 was included in 5.

          Show
          andyhind Andy Hind added a comment - - edited Hi Tommaso, the MinHashFilterTest was running fine. It was JapaneseNumberFilter that was failing intermittently. I think on one of the evil test cases. LongPair should implement equals (and probably hashCode if it will be reused) as it goes into a TreeSet. An over sight on my part. FWIW, as far as I can tell, the change in patch 6 was included in 5.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 82a9244193ba948142b834ec08e2de0d98cfba9f in lucene-solr's branch refs/heads/master from Tommaso Teofili
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82a9244 ]

          LUCENE-6968 - MinHash filter, thanks to Andy Hind and Cao Manh Dat for patches

          Show
          jira-bot ASF subversion and git services added a comment - Commit 82a9244193ba948142b834ec08e2de0d98cfba9f in lucene-solr's branch refs/heads/master from Tommaso Teofili [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82a9244 ] LUCENE-6968 - MinHash filter, thanks to Andy Hind and Cao Manh Dat for patches
          Hide
          teofili Tommaso Teofili added a comment -

          I've committed this, thanks to Andy Hind and Cao Manh Dat for your patches!

          Show
          teofili Tommaso Teofili added a comment - I've committed this, thanks to Andy Hind and Cao Manh Dat for your patches!
          Hide
          andyhind Andy Hind added a comment -

          Hi Tommaso - are you planning to merge this to 6.x?

          Show
          andyhind Andy Hind added a comment - Hi Tommaso - are you planning to merge this to 6.x?
          Hide
          teofili Tommaso Teofili added a comment -

          yes, I plan to merge it to 6.x, I wanted to have a few more runs on Jenkins before merging it back to make sure there're no token filtering level issues.

          Show
          teofili Tommaso Teofili added a comment - yes, I plan to merge it to 6.x, I wanted to have a few more runs on Jenkins before merging it back to make sure there're no token filtering level issues.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 82a9244193ba948142b834ec08e2de0d98cfba9f in lucene-solr's branch refs/heads/apiv2 from Tommaso Teofili
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82a9244 ]

          LUCENE-6968 - MinHash filter, thanks to Andy Hind and Cao Manh Dat for patches

          Show
          jira-bot ASF subversion and git services added a comment - Commit 82a9244193ba948142b834ec08e2de0d98cfba9f in lucene-solr's branch refs/heads/apiv2 from Tommaso Teofili [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82a9244 ] LUCENE-6968 - MinHash filter, thanks to Andy Hind and Cao Manh Dat for patches
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 96a5595fffd4a4a1c553a987382697cb8a92b354 in lucene-solr's branch refs/heads/branch_6x from Tommaso Teofili
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=96a5595 ]

          LUCENE-6968 - LSH Filter backported to branch 6.x

          Show
          jira-bot ASF subversion and git services added a comment - Commit 96a5595fffd4a4a1c553a987382697cb8a92b354 in lucene-solr's branch refs/heads/branch_6x from Tommaso Teofili [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=96a5595 ] LUCENE-6968 - LSH Filter backported to branch 6.x
          Hide
          teofili Tommaso Teofili added a comment -

          I've backported this to branch 6.x for inclusion in next 6.2 release.

          Show
          teofili Tommaso Teofili added a comment - I've backported this to branch 6.x for inclusion in next 6.2 release.
          Hide
          varunthacker Varun Thacker added a comment -

          Hi Tommaso,

          I think we need to fix the CHANGES entry to move the entry to the 6.2 section. Its under 7.0 section currently

          Show
          varunthacker Varun Thacker added a comment - Hi Tommaso, I think we need to fix the CHANGES entry to move the entry to the 6.2 section. Its under 7.0 section currently
          Hide
          mikemccand Michael McCandless added a comment -

          Bulk close resolved issues after 6.2.0 release.

          Show
          mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Is there a JIRA issue yet to expose (and test) this and MinHash in Solr?

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Is there a JIRA issue yet to expose (and test) this and MinHash in Solr?
          Hide
          teofili Tommaso Teofili added a comment -

          Yonik Seeley the MinHash filter can be already used in Solr just like any other TokenFilter using its MinHashFilterFactory; regarding the test, there used to be one to test its end to end usage in one of the first patches, however it was decided to keep only the BaseTokenStreamTestCase ones as to check the filtering was working correctly. Perhaps we can put the old usage test in the context of Solr and add it within the Solr tests.

          Show
          teofili Tommaso Teofili added a comment - Yonik Seeley the MinHash filter can be already used in Solr just like any other TokenFilter  using its MinHashFilterFactory ; regarding the test, there used to be one to test its end to end usage in one of the first patches, however it was decided to keep only the BaseTokenStreamTestCase ones as to check the filtering was working correctly. Perhaps we can put the old usage test in the context of Solr and add it within the Solr tests.
          Hide
          andyhind Andy Hind added a comment -

          There are probably some more bits required to integrate minhash with SOLR but it depends on the use case.
          A SOLR specific ticket would make sense.

          Finding likely duplicates/clutering from the whole corpus LSH style I have not thought about.....

          "Near duplicates of a document" would make sense and overlaps a bit with MLT. First this requires storing the fingerprint, recovering it in a distributed way and building a query. It probably makes sense to support a big OR constant score query, a big OR constant score query with a minimum match, and a banded query that reflects the usual LSH stuff. An example was in the unit tests that was removed (it needed a bit of refinement to avoid a small bucket). Banding groups random fingerprints into AND queries and ORs these together.

          I can provide the query logic.

          FYI - adding a 500 term fingerprint increased the index size ~10%

          Show
          andyhind Andy Hind added a comment - There are probably some more bits required to integrate minhash with SOLR but it depends on the use case. A SOLR specific ticket would make sense. Finding likely duplicates/clutering from the whole corpus LSH style I have not thought about..... "Near duplicates of a document" would make sense and overlaps a bit with MLT. First this requires storing the fingerprint, recovering it in a distributed way and building a query. It probably makes sense to support a big OR constant score query, a big OR constant score query with a minimum match, and a banded query that reflects the usual LSH stuff. An example was in the unit tests that was removed (it needed a bit of refinement to avoid a small bucket). Banding groups random fingerprints into AND queries and ORs these together. I can provide the query logic. FYI - adding a 500 term fingerprint increased the index size ~10%

            People

            • Assignee:
              teofili Tommaso Teofili
              Reporter:
              caomanhdat Cao Manh Dat
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development