Solr
  1. Solr
  2. SOLR-908

Port of Nutch CommonGrams filter to Solr

    Details

    • Type: Wish Wish
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Phrase queries containing common words are extremely slow. We are reluctant to just use stop words due to various problems with false hits and some things becoming impossible to search with stop words turned on. (For example "to be or not to be", "the who", "man in the moon" vs "man on the moon" etc.)

      Several postings regarding slow phrase queries have suggested using the approach used by Nutch. Perhaps someone with more Java/Solr experience might take this on.

      It should be possible to port the Nutch CommonGrams code to Solr and create a suitable Solr FilterFactory so that it could be used in Solr by listing it in the Solr schema.xml.

      "Construct n-grams for frequently occuring terms and phrases while indexing. Optimize phrase queries to use the n-grams. Single terms are still indexed too, with n-grams overlaid."
      http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/CommonGrams.html

      1. CommonGramsPort.zip
        14 kB
        Tom Burton-West
      2. SOLR-908.patch
        38 kB
        Jason Rutherglen
      3. SOLR-908.patch
        37 kB
        Jason Rutherglen
      4. SOLR-908.patch
        36 kB
        Jason Rutherglen
      5. SOLR-908.patch
        38 kB
        Jason Rutherglen
      6. SOLR-908.patch
        43 kB
        Jason Rutherglen
      7. SOLR-908.patch
        45 kB
        Jason Rutherglen
      8. SOLR-908.patch
        91 kB
        Jason Rutherglen
      9. SOLR-908.patch
        47 kB
        Tom Burton-West
      10. SOLR-908.patch
        46 kB
        Tom Burton-West

        Issue Links

          Activity

          Hide
          Tom Burton-West added a comment - - edited

          Attached is my first cut at a port of the Nutch CommonGrams filter to Solr. I still need to write tests for CommonGramsFilterFactory and CommonGramsQueryFilterFactory. Nutch had a method call to optimize phrase queries. For Solr I just wrote CommonGramsQueryFilter for handling queries. Preliminary tests with a relatively small* index of 100,000 full-text documents (index size 44GB, about 30% larger than the index without commongrams) indicate about 10x decrease in response times for phrase queries.

          This post by Hoss was extremely helpful as was his suggestion to use the Solr BufferedTokenStream as a base class:
          http://www.nabble.com/Re%3A-Index---search-questions--special-cases-p7344056.html

          Here is an example schema.xml entry
          <fieldType name="ocrCommon" class="solr.TextField" positionIncrementGap="100">
          <analyzer type="index">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.CommonGramsFilterFactory" words="commonwords.txt"/>
          </analyzer>
          <analyzer type="query">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.CommonGramsQueryFilterFactory" words="commonwords.txt"/>
          </analyzer>
          </fieldType>

          Tom Burton-West
          University of Michigan Library
          -----------------------------------
          *We are working on indexing 1-6 million full-text documents. Our current 1 million document index is about 235GB, so in our context 100,000 docs is relatively small.

          The files in CommonGramsPort.zip are:
          CommonGramsFilter.java
          CommonGramsFilterTest.java
          CommonGramsFilterFactory.java
          CommonGramsQueryFilter.java
          CommonGramsQueryFilterFactory.java
          CommonGramsQueryFilterTest.java
          TestCommonGrams.java (Non-junit test for input on STDIN)

          Show
          Tom Burton-West added a comment - - edited Attached is my first cut at a port of the Nutch CommonGrams filter to Solr. I still need to write tests for CommonGramsFilterFactory and CommonGramsQueryFilterFactory. Nutch had a method call to optimize phrase queries. For Solr I just wrote CommonGramsQueryFilter for handling queries. Preliminary tests with a relatively small* index of 100,000 full-text documents (index size 44GB, about 30% larger than the index without commongrams) indicate about 10x decrease in response times for phrase queries. This post by Hoss was extremely helpful as was his suggestion to use the Solr BufferedTokenStream as a base class: http://www.nabble.com/Re%3A-Index---search-questions--special-cases-p7344056.html Here is an example schema.xml entry <fieldType name="ocrCommon" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.CommonGramsFilterFactory" words="commonwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.CommonGramsQueryFilterFactory" words="commonwords.txt"/> </analyzer> </fieldType> Tom Burton-West University of Michigan Library ----------------------------------- *We are working on indexing 1-6 million full-text documents. Our current 1 million document index is about 235GB, so in our context 100,000 docs is relatively small. The files in CommonGramsPort.zip are: CommonGramsFilter.java CommonGramsFilterTest.java CommonGramsFilterFactory.java CommonGramsQueryFilter.java CommonGramsQueryFilterFactory.java CommonGramsQueryFilterTest.java TestCommonGrams.java (Non-junit test for input on STDIN)
          Hide
          Tom Burton-West added a comment - - edited

          Attached is a patch which includes the code and unit tests for CommonGramsFilterFactory and CommonGramsQueryFilterFactory as well as the doce and unit tests for CommonGramsFilter and CommonGramsQueryFilter.

          Show
          Tom Burton-West added a comment - - edited Attached is a patch which includes the code and unit tests for CommonGramsFilterFactory and CommonGramsQueryFilterFactory as well as the doce and unit tests for CommonGramsFilter and CommonGramsQueryFilter.
          Hide
          Otis Gospodnetic added a comment -

          I took a super quick look and noticed:

          • not all classes have ASL (I think unit test classes need it, too)
          • Mentions of Copyright 2009, The Regents of The University of Michigan. I have a feeling this would need to be removed
          • @author and @version. I know we remove @author lines, and I'm not sure if @version is really desired

          Looks like a very thorough and complete patch, but I haven't tried it yet.

          Show
          Otis Gospodnetic added a comment - I took a super quick look and noticed: not all classes have ASL (I think unit test classes need it, too) Mentions of Copyright 2009, The Regents of The University of Michigan. I have a feeling this would need to be removed @author and @version. I know we remove @author lines, and I'm not sure if @version is really desired Looks like a very thorough and complete patch, but I haven't tried it yet.
          Hide
          Tom Burton-West added a comment -

          Thanks Otis. I'll add the ASL and remove the @author and @version. I must have misunderstood the ASL regarding copyright. I'll go ahead and remove the UMich copyright statements as well. Should I also remove the TODO's in the comments?

          Tom

          Show
          Tom Burton-West added a comment - Thanks Otis. I'll add the ASL and remove the @author and @version. I must have misunderstood the ASL regarding copyright. I'll go ahead and remove the UMich copyright statements as well. Should I also remove the TODO's in the comments? Tom
          Hide
          Otis Gospodnetic added a comment -

          Thanks Tom. TODOs are good reminders, so I'd say leave them.

          Show
          Otis Gospodnetic added a comment - Thanks Tom. TODOs are good reminders, so I'd say leave them.
          Hide
          Hoss Man added a comment -

          FWIW: There are some decently readable docs from the legal team about this topic...

          http://www.apache.org/legal/src-headers.html

          Show
          Hoss Man added a comment - FWIW: There are some decently readable docs from the legal team about this topic... http://www.apache.org/legal/src-headers.html
          Hide
          Tom Burton-West added a comment -

          Cleaned up patch.

          ASF license now included in all files.
          Umich copyright statements removed
          Removed author and revision tags
          Cleaned up code and reformatted using Solr Eclipse codestyle

          Show
          Tom Burton-West added a comment - Cleaned up patch. ASF license now included in all files. Umich copyright statements removed Removed author and revision tags Cleaned up code and reformatted using Solr Eclipse codestyle
          Hide
          Jason Rutherglen added a comment -

          Great Tom!

          • Can we add a flag to not return the actual stop words? (I know
            we could add a StopWordFilter after, however it seems redundant?)
          • CommonGramsFilter.ArrayTokens is not used?
          Show
          Jason Rutherglen added a comment - Great Tom! Can we add a flag to not return the actual stop words? (I know we could add a StopWordFilter after, however it seems redundant?) CommonGramsFilter.ArrayTokens is not used?
          Hide
          Jason Rutherglen added a comment -
          • Added an includeCommon flag which when false, does not return
            common words in the token stream
          • Removed static stop words, we refer to
            StopAnalyzer.ENGLISH_STOP_WORDS
          • Needs test cases for includeCommon
          • Maybe there's more redundant code that can be removed or
            adjusted?
          • Can we change this to be a fix for 1.4?
          Show
          Jason Rutherglen added a comment - Added an includeCommon flag which when false, does not return common words in the token stream Removed static stop words, we refer to StopAnalyzer.ENGLISH_STOP_WORDS Needs test cases for includeCommon Maybe there's more redundant code that can be removed or adjusted? Can we change this to be a fix for 1.4?
          Hide
          Jason Rutherglen added a comment -

          Sorry, patch without the patch inlined!

          Show
          Jason Rutherglen added a comment - Sorry, patch without the patch inlined!
          Hide
          Tom Burton-West added a comment -

          Hi Jason,

          Thanks for the contribution. I applied the patch to the latest Solr and the QueryFilter tests failed (see below). Perhaps something is wrong with my configuration. Did they pass for you.?

          I'll dig in to this tomorrow.

          Tom.

          TEST-org.apache.solr.analysis.CommonGramsQueryFilterTest.xml:<testsuite errors="0" failures="17" hostname="ULIB-LIT0602"
          name="org.apache.solr.analysis.CommonGramsQueryFilterTest" tests="22" time="0.453" timestamp="2009-07-29T19:49:24">
          TESTS-TestSuites.xml: <testsuite errors="0" failures="17" hostname="ULIB-LIT0602" id="12" name="CommonGramsQueryFilterT
          est" package="org.apache.solr.analysis" tests="22" time="0.453" timestamp="2009-07-29T19:49:24">
          bash-3.2$

          Show
          Tom Burton-West added a comment - Hi Jason, Thanks for the contribution. I applied the patch to the latest Solr and the QueryFilter tests failed (see below). Perhaps something is wrong with my configuration. Did they pass for you.? I'll dig in to this tomorrow. Tom. TEST-org.apache.solr.analysis.CommonGramsQueryFilterTest.xml:<testsuite errors="0" failures="17" hostname="ULIB-LIT0602" name="org.apache.solr.analysis.CommonGramsQueryFilterTest" tests="22" time="0.453" timestamp="2009-07-29T19:49:24"> TESTS-TestSuites.xml: <testsuite errors="0" failures="17" hostname="ULIB-LIT0602" id="12" name="CommonGramsQueryFilterT est" package="org.apache.solr.analysis" tests="22" time="0.453" timestamp="2009-07-29T19:49:24"> bash-3.2$
          Hide
          Jason Rutherglen added a comment -

          I'll post a new patch tomorrow, essentially nothing is changing,
          just cleaned up some of the code. Adding StopFilterFactory at
          the end of the analyzer chain performed the same function as
          includeCommon=false with no performance difference.

          Show
          Jason Rutherglen added a comment - I'll post a new patch tomorrow, essentially nothing is changing, just cleaned up some of the code. Adding StopFilterFactory at the end of the analyzer chain performed the same function as includeCommon=false with no performance difference.
          Hide
          Tom Burton-West added a comment -

          Thanks for cleaning up the code.

          I found the problem that was causing the CommonGramsQueryFilter test to fail. CommonGramsQueryFilter checks for token type = "gram". Since the patch changed the type to "shingle" in CommonGramsFilter,, the same change has to be made in CommonGramsQueryFilter.

          I'm suprised that adding StopFilterFactory to the end of the filter change doesn't affect performance. I'll wait for your new patch before proceeding.

          BTW you asked >>Can we change this to be a fix for 1.4?
          I'd love to, but don't the committers make that decision? How do we do that?

          Tom

          Show
          Tom Burton-West added a comment - Thanks for cleaning up the code. I found the problem that was causing the CommonGramsQueryFilter test to fail. CommonGramsQueryFilter checks for token type = "gram". Since the patch changed the type to "shingle" in CommonGramsFilter,, the same change has to be made in CommonGramsQueryFilter. I'm suprised that adding StopFilterFactory to the end of the filter change doesn't affect performance. I'll wait for your new patch before proceeding. BTW you asked >>Can we change this to be a fix for 1.4? I'd love to, but don't the committers make that decision? How do we do that? Tom
          Hide
          Shalin Shekhar Mangar added a comment -

          BTW you asked >>Can we change this to be a fix for 1.4?
          I'd love to, but don't the committers make that decision? How do we do that?

          Considering that:

          1. The issue is old and a patch has been here in one form or another since April
          2. We have ample time before 1.4 release

          I see no reason why it can't be committed for 1.4. Otis, since you have looked at this in the past, will you take this up? Or, I can try to have a look this weekend.

          Show
          Shalin Shekhar Mangar added a comment - BTW you asked >>Can we change this to be a fix for 1.4? I'd love to, but don't the committers make that decision? How do we do that? Considering that: The issue is old and a patch has been here in one form or another since April We have ample time before 1.4 release I see no reason why it can't be committed for 1.4. Otis, since you have looked at this in the past, will you take this up? Or, I can try to have a look this weekend.
          Hide
          Jason Rutherglen added a comment -

          We ran an indexing performance test with CommonGramFilter and a
          heavily modified/optimized filter state machine from the book
          Building Search Applications that does not emit stop words. I
          tried to change CommonGramFilter to not emit stop words, however
          BufferedTokenStream seems to only move forwards.

          CommonGramFilter was 10% faster with the StopWordsFilter at the
          end of the chain compared with the book's version. The book has
          a GNU license for it's code so I cannot post it here.

          The attached patch just cleans up a few things. Perhaps more of
          the test code can be shared?

          To mark the issue as 1.4 you'd edit it, change affected version
          to 1.3, fix version to 1.4.

          Show
          Jason Rutherglen added a comment - We ran an indexing performance test with CommonGramFilter and a heavily modified/optimized filter state machine from the book Building Search Applications that does not emit stop words. I tried to change CommonGramFilter to not emit stop words, however BufferedTokenStream seems to only move forwards. CommonGramFilter was 10% faster with the StopWordsFilter at the end of the chain compared with the book's version. The book has a GNU license for it's code so I cannot post it here. The attached patch just cleans up a few things. Perhaps more of the test code can be shared? To mark the issue as 1.4 you'd edit it, change affected version to 1.3, fix version to 1.4.
          Hide
          Otis Gospodnetic added a comment -

          I won't get to it before going on vacation. Assigning to you if you want it.

          Show
          Otis Gospodnetic added a comment - I won't get to it before going on vacation. Assigning to you if you want it.
          Hide
          Shalin Shekhar Mangar added a comment -

          Perhaps more of the test code can be shared?

          Can't we use BaseTokenTestCase's helper methods?

          Show
          Shalin Shekhar Mangar added a comment - Perhaps more of the test code can be shared? Can't we use BaseTokenTestCase's helper methods?
          Hide
          Jason Rutherglen added a comment -
          • Cleaned up more
          • Consolidated the test classes into one
          • Formatting in CHANGES.txt for 66 seemed off, fixed it
          • I didn't see how we could use BaseTokenTestCase
          Show
          Jason Rutherglen added a comment - Cleaned up more Consolidated the test classes into one Formatting in CHANGES.txt for 66 seemed off, fixed it I didn't see how we could use BaseTokenTestCase
          Hide
          Jason Rutherglen added a comment -

          I'll add resuableToken support to this patch.

          Show
          Jason Rutherglen added a comment - I'll add resuableToken support to this patch.
          Hide
          Jason Rutherglen added a comment -

          Tom,

          Is it our intention to support phrase queries with slop? I'm not sure this works?

          Show
          Jason Rutherglen added a comment - Tom, Is it our intention to support phrase queries with slop? I'm not sure this works?
          Hide
          Tom Burton-West added a comment -

          Hi Jason,

          Thanks for all your work on this.

          I would be inclined to not deal with supporting slop at this point.

          I guess slop of more than 0 will work as long as the query matches the commongrams exactly. (so each commongram gets treated as one token for calculating slop). Otherwise it would seem that you would have to generate commongrams for each combination of a common word and words n edit distance away to support a slop of n. At least for our use case, the increase in the size of the index would not be worth it.

          Tom

          Show
          Tom Burton-West added a comment - Hi Jason, Thanks for all your work on this. I would be inclined to not deal with supporting slop at this point. I guess slop of more than 0 will work as long as the query matches the commongrams exactly. (so each commongram gets treated as one token for calculating slop). Otherwise it would seem that you would have to generate commongrams for each combination of a common word and words n edit distance away to support a slop of n. At least for our use case, the increase in the size of the index would not be worth it. Tom
          Hide
          Jason Rutherglen added a comment -

          Tom,

          I'll add in the javadocs that if a phrase has a stop word, slop will not work.

          Show
          Jason Rutherglen added a comment - Tom, I'll add in the javadocs that if a phrase has a stop word, slop will not work.
          Hide
          Jason Rutherglen added a comment -

          I'm reworking BufferedTokenStream to use the new token API. It
          seems like caching tokens goes against the way the new token api
          works. Is there an example to follow, it seems like we're
          allocating objects per gram (which is what we're trying to
          avoid?).

          Show
          Jason Rutherglen added a comment - I'm reworking BufferedTokenStream to use the new token API. It seems like caching tokens goes against the way the new token api works. Is there an example to follow, it seems like we're allocating objects per gram (which is what we're trying to avoid?).
          Hide
          Jason Rutherglen added a comment -

          There seems to be a bug in CommonGramsQueryFilterFactory where the last word of a non common word query is removed. I'm going to fix it and check in a new patch.

          Show
          Jason Rutherglen added a comment - There seems to be a bug in CommonGramsQueryFilterFactory where the last word of a non common word query is removed. I'm going to fix it and check in a new patch.
          Hide
          Tom Burton-West added a comment -

          Hi Jason,

          If the last non-common word in a query is part of the previous common-gram then it gets removed by design.
          i.e. query=" see the hat" query filter output should be |see-the|the-hat|

          Is this a different case? The test cases should have caught the bug. Maybe we need to add a few more test cases. Could you send me a test case that shows the bug?

          Tom

          Show
          Tom Burton-West added a comment - Hi Jason, If the last non-common word in a query is part of the previous common-gram then it gets removed by design. i.e. query=" see the hat" query filter output should be |see-the|the-hat| Is this a different case? The test cases should have caught the bug. Maybe we need to add a few more test cases. Could you send me a test case that shows the bug? Tom
          Hide
          Jason Rutherglen added a comment -

          I changed: if (prev != null && prev.type() != "gram") from
          checking for "word" which when a token was created from
          StandardFilter, would be of type "<ALPHANUM>" and was being
          discarded. This was causing at least one of the bugs, though
          there may be another.

          Show
          Jason Rutherglen added a comment - I changed: if (prev != null && prev.type() != "gram") from checking for "word" which when a token was created from StandardFilter, would be of type "<ALPHANUM>" and was being discarded. This was causing at least one of the bugs, though there may be another.
          Hide
          Jason Rutherglen added a comment -

          There is a bug that seems to be related to
          HTMLStripStandardTokenizerFactory where a single word query
          fails to generate a token using the following chain. However a
          StandardTokenizer in it's place returns a token as expected.
          When SOLR-908.patch was tested with rev 799698, HTMLSSTF worked.

          <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
          <filter class="solr.StandardFilterFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.EnglishPorterFilterFactory" protected="stopwords.txt"/> 
          <filter class="solr.CommonGramsQueryFilterFactory" words="stopwords.txt"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
          
          Show
          Jason Rutherglen added a comment - There is a bug that seems to be related to HTMLStripStandardTokenizerFactory where a single word query fails to generate a token using the following chain. However a StandardTokenizer in it's place returns a token as expected. When SOLR-908 .patch was tested with rev 799698, HTMLSSTF worked. <tokenizer class= "solr.HTMLStripStandardTokenizerFactory" /> <filter class= "solr.StandardFilterFactory" /> <filter class= "solr.LowerCaseFilterFactory" /> <filter class= "solr.EnglishPorterFilterFactory" protected = "stopwords.txt" /> <filter class= "solr.CommonGramsQueryFilterFactory" words= "stopwords.txt" /> <filter class= "solr.StopFilterFactory" ignoreCase= " true " words= "stopwords.txt" enablePositionIncrements= " true " />
          Hide
          Shalin Shekhar Mangar added a comment -

          I'll not be able to look at this before the end of the week, so un-assigning myself.

          Show
          Shalin Shekhar Mangar added a comment - I'll not be able to look at this before the end of the week, so un-assigning myself.
          Hide
          Jason Rutherglen added a comment -

          Is there a reason this patch is unassigned for 1.4?

          Show
          Jason Rutherglen added a comment - Is there a reason this patch is unassigned for 1.4?
          Hide
          Jason Rutherglen added a comment -

          This schema consistently and randomly generates query
          truncations. Perhaps because we're mixing the new and old
          tokenizing APIs? I can't figure out what state is being shared
          nor how to debug this. We unfortunately upgraded to Solr 1.4
          trunk and so cannot revert back to 1.3. I wrote a test case that
          has not reproduced the bug locally. The bug happens in a
          distributed environment with 20+ servers.

          <fieldType name="vCommonGrams" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
            <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>  
            <filter class="solr.StandardFilterFactory"/>
            <filter class="solr.CommonGramsFilterFactory" ignoreCase="true" words="stopwords.txt"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
            </analyzer>
            <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StandardFilterFactory"/>
            <filter class="solr.CommonGramsQueryFilterFactory" ignoreCase="true" words="stopwords.txt"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
            </analyzer>
          </fieldType>
          
          Show
          Jason Rutherglen added a comment - This schema consistently and randomly generates query truncations. Perhaps because we're mixing the new and old tokenizing APIs? I can't figure out what state is being shared nor how to debug this. We unfortunately upgraded to Solr 1.4 trunk and so cannot revert back to 1.3. I wrote a test case that has not reproduced the bug locally. The bug happens in a distributed environment with 20+ servers. <fieldType name= "vCommonGrams" class= "solr.TextField" positionIncrementGap= "100" > <analyzer type= "index" > <tokenizer class= "solr.HTMLStripStandardTokenizerFactory" /> <filter class= "solr.StandardFilterFactory" /> <filter class= "solr.CommonGramsFilterFactory" ignoreCase= " true " words= "stopwords.txt" /> <filter class= "solr.StopFilterFactory" ignoreCase= " true " words= "stopwords.txt" enablePositionIncrements= " true " /> </analyzer> <analyzer type= "query" > <tokenizer class= "solr.StandardTokenizerFactory" /> <filter class= "solr.StandardFilterFactory" /> <filter class= "solr.CommonGramsQueryFilterFactory" ignoreCase= " true " words= "stopwords.txt" /> <filter class= "solr.StopFilterFactory" ignoreCase= " true " words= "stopwords.txt" enablePositionIncrements= " true " /> </analyzer> </fieldType>
          Hide
          Jason Rutherglen added a comment -

          It looks like our problem could be due to
          Analyzer.reusableTokenStream and how it reuses tokenstreams from
          a thread local variable. This would explain the random behavior
          (i.e. depending on the thread one was assigned for a query, the
          associated token stream, if it were in an invalid state, would
          return incorrect results). I'm thinking reusableTokenStream can
          be overridden to return a new token stream each time? And so
          bypass whatever reseting issue is occurring from the mixture of
          the old and new tokenizer APIs.

          Show
          Jason Rutherglen added a comment - It looks like our problem could be due to Analyzer.reusableTokenStream and how it reuses tokenstreams from a thread local variable. This would explain the random behavior (i.e. depending on the thread one was assigned for a query, the associated token stream, if it were in an invalid state, would return incorrect results). I'm thinking reusableTokenStream can be overridden to return a new token stream each time? And so bypass whatever reseting issue is occurring from the mixture of the old and new tokenizer APIs.
          Hide
          Yonik Seeley added a comment -

          Jason, at a quick look, I see that this filter maintains state, but doesn't implement reset() - could that be the issue?

          Show
          Yonik Seeley added a comment - Jason, at a quick look, I see that this filter maintains state, but doesn't implement reset() - could that be the issue?
          Hide
          Robert Muir added a comment -

          just my opinion, do not think this problem is due to mixed tokenizer APIs (LUCENE-1919)

          this is because this BufferedTokenStream does not mix the apis that cause that issue... it only uses TokenStream.next()

          i think instead Yonik might be on the right track, could be wrong.

          Show
          Robert Muir added a comment - just my opinion, do not think this problem is due to mixed tokenizer APIs ( LUCENE-1919 ) this is because this BufferedTokenStream does not mix the apis that cause that issue... it only uses TokenStream.next() i think instead Yonik might be on the right track, could be wrong.
          Hide
          Uwe Schindler added a comment -

          In my opinion, the problem is BufferedTokenStream (should its name not BufferedTokenFilter?). It has the linked list but does not implement reset(). So the problem is not this issue, more the usage of reset because you reuse the token stream. As long as BufferedTokenStream is not fixed to support reset() you have to create new instances.

          Show
          Uwe Schindler added a comment - In my opinion, the problem is BufferedTokenStream (should its name not BufferedTokenFilter?). It has the linked list but does not implement reset(). So the problem is not this issue, more the usage of reset because you reuse the token stream. As long as BufferedTokenStream is not fixed to support reset() you have to create new instances.
          Hide
          Robert Muir added a comment -

          Uwe, i opened an issue for this: SOLR-1446

          i think even if not the cause of this problem, BufferedTokenStream should implement reset() since it keeps internal state.

          Show
          Robert Muir added a comment - Uwe, i opened an issue for this: SOLR-1446 i think even if not the cause of this problem, BufferedTokenStream should implement reset() since it keeps internal state.
          Hide
          Robert Muir added a comment -

          similar to the BufferedTokenStream reset, the CommonGramsQueryFilter here has its own internal state:

          private Token prev;
          

          so this filter too should implement reset (and must call super.reset() so the BufferedTokenStream lists get reset too).

          Show
          Robert Muir added a comment - similar to the BufferedTokenStream reset, the CommonGramsQueryFilter here has its own internal state: private Token prev; so this filter too should implement reset (and must call super.reset() so the BufferedTokenStream lists get reset too).
          Hide
          Jason Rutherglen added a comment -

          Interesting, the whole reusableTokenStream model is new to me,
          so it wasn't in my mental view of how Lucene analyzers work. It
          seems if BTS is caching tokens, then being reused, and isn't
          reset, then there would be excess tokens instead of deletions?
          Or perhaps the reset is being called from another analyzer? It's
          quite confusing. I started work on a LoggingTokenizer that could
          be inserted between tokenizers in the Solr schema, however have
          been working on reproducing the issue (which hasn't worked
          either).

          Uwe, Yonik, and Robert, thanks for taking a look!

          Show
          Jason Rutherglen added a comment - Interesting, the whole reusableTokenStream model is new to me, so it wasn't in my mental view of how Lucene analyzers work. It seems if BTS is caching tokens, then being reused, and isn't reset, then there would be excess tokens instead of deletions? Or perhaps the reset is being called from another analyzer? It's quite confusing. I started work on a LoggingTokenizer that could be inserted between tokenizers in the Solr schema, however have been working on reproducing the issue (which hasn't worked either). Uwe, Yonik, and Robert, thanks for taking a look!
          Hide
          Robert Muir added a comment -

          It seems if BTS is caching tokens, then being reused, and isn't
          reset, then there would be excess tokens instead of deletions?

          right, thats what the test case I added for BufferedTokenStream showed.
          this would be more of a corner case, as i think most BufferedTokenStreams would have empty lists anyway
          by the time they are reset(), so its likely not causing your problem (though it should be fixed!)

          your problem, again is probably the internal state kept in CommonGramsQueryFilter
          as you can see, CommonGramsQueryFilter has hairy logic involving the buffered token 'prev'
          a lot of this logic has to do with what happens at end of stream.

          unfortunately there is no reset() for CommonGramsQueryFilter to set 'prev' back to its initial state, so when something like QueryParser tries to reuse it, it is probably not behaving correctly.

          Show
          Robert Muir added a comment - It seems if BTS is caching tokens, then being reused, and isn't reset, then there would be excess tokens instead of deletions? right, thats what the test case I added for BufferedTokenStream showed. this would be more of a corner case, as i think most BufferedTokenStreams would have empty lists anyway by the time they are reset(), so its likely not causing your problem (though it should be fixed!) your problem, again is probably the internal state kept in CommonGramsQueryFilter as you can see, CommonGramsQueryFilter has hairy logic involving the buffered token 'prev' a lot of this logic has to do with what happens at end of stream. unfortunately there is no reset() for CommonGramsQueryFilter to set 'prev' back to its initial state, so when something like QueryParser tries to reuse it, it is probably not behaving correctly.
          Hide
          Yonik Seeley added a comment -

          I guess if something causes an exception during analysis, things like BufferedTokenStream can be left with unwanted state.
          Note that BufferedTokenStream didn't inherit from TokenFilter and thus wouldn't automatically chain the reset() to it's input... so any upstream filters wouldn't be reset(). I just fixed that.

          Show
          Yonik Seeley added a comment - I guess if something causes an exception during analysis, things like BufferedTokenStream can be left with unwanted state. Note that BufferedTokenStream didn't inherit from TokenFilter and thus wouldn't automatically chain the reset() to it's input... so any upstream filters wouldn't be reset(). I just fixed that.
          Hide
          Jason Rutherglen added a comment -

          Added reset overrides to CommonGramsFilter and CommonGramsQueryFilter.

          Show
          Jason Rutherglen added a comment - Added reset overrides to CommonGramsFilter and CommonGramsQueryFilter.
          Hide
          Robert Muir added a comment -

          jason, i took a glance. i think the reset() for CommonGramsQueryFilter should not set prev = null
          this is because the initial state is not null:
          in the ctor, prev = new Token()
          with the current logic, this is what reset() must do also.

          also, fyi CommonGramsFilter does not need a reset since the stringbuffer isn't used to keep state,

          the best way I think to ensure its correct i think, is to add tests that consume and reuse/reset()

          Show
          Robert Muir added a comment - jason, i took a glance. i think the reset() for CommonGramsQueryFilter should not set prev = null this is because the initial state is not null: in the ctor, prev = new Token() with the current logic, this is what reset() must do also. also, fyi CommonGramsFilter does not need a reset since the stringbuffer isn't used to keep state, the best way I think to ensure its correct i think, is to add tests that consume and reuse/reset()
          Hide
          Jason Rutherglen added a comment -

          Robert thanks. I added the new token in CGQF.reset and reset test cases.

          Show
          Jason Rutherglen added a comment - Robert thanks. I added the new token in CGQF.reset and reset test cases.
          Hide
          Yonik Seeley added a comment -

          I've updated the lucene libs to rc5 in trunk - hopefully that, coupled with the recent changes to this patch will eliminate the flakiness.

          Show
          Yonik Seeley added a comment - I've updated the lucene libs to rc5 in trunk - hopefully that, coupled with the recent changes to this patch will eliminate the flakiness.
          Hide
          Jason Rutherglen added a comment -

          Yeah, unfortunately it's going to be hard to upgrade as folks
          feel a bit burned at this point and reverting to Solr trunk 8/31
          plus the old HTMLStripReader which seems to be more stable than
          the latest Solr builds. I need to reproduce our wacky crazy
          random query truncations, and haven't yet. I'll probably try
          creating completely randomized queries in multiple threads and
          see what happens. Without reproducing the problem and showing it
          fixed, upgrading will be difficult to justify. Logically the
          threadlocal reusableTokenStream is the problem, however,
          perception is things got way too broken.

          Also I need to upgrade the patch to use the new tokenizing API.
          I think this belongs in Lucene analyzers rather than in Solr
          anyways, and BufferedTokenStream totally changes with the new
          tokenizing API. Hacking ShingleFilter to only include certain
          words seemed like too much of a rewrite of it. So porting is the
          next task here after hopefully reproducing.

          Show
          Jason Rutherglen added a comment - Yeah, unfortunately it's going to be hard to upgrade as folks feel a bit burned at this point and reverting to Solr trunk 8/31 plus the old HTMLStripReader which seems to be more stable than the latest Solr builds. I need to reproduce our wacky crazy random query truncations, and haven't yet. I'll probably try creating completely randomized queries in multiple threads and see what happens. Without reproducing the problem and showing it fixed, upgrading will be difficult to justify. Logically the threadlocal reusableTokenStream is the problem, however, perception is things got way too broken. Also I need to upgrade the patch to use the new tokenizing API. I think this belongs in Lucene analyzers rather than in Solr anyways, and BufferedTokenStream totally changes with the new tokenizing API. Hacking ShingleFilter to only include certain words seemed like too much of a rewrite of it. So porting is the next task here after hopefully reproducing.
          Hide
          Jason Rutherglen added a comment -

          Thanks everyone on this, we deployed the latest patch with the latest Solr build and things work now, reliably.

          Show
          Jason Rutherglen added a comment - Thanks everyone on this, we deployed the latest patch with the latest Solr build and things work now, reliably.
          Hide
          Yonik Seeley added a comment -

          Yay! I think we could commit this now... any objections/concerns with the latest patch?

          Show
          Yonik Seeley added a comment - Yay! I think we could commit this now... any objections/concerns with the latest patch?
          Hide
          Shalin Shekhar Mangar added a comment -

          Yay! I think we could commit this now... any objections/concerns with the latest patch?

          Looks good. +1 for commit.

          Show
          Shalin Shekhar Mangar added a comment - Yay! I think we could commit this now... any objections/concerns with the latest patch? Looks good. +1 for commit.
          Hide
          Robert Muir added a comment -

          i've been doing some tests, latest patch has been working fine for me.
          i think jason cleared up any remaining reset/reusable issues with this latest patch.

          Show
          Robert Muir added a comment - i've been doing some tests, latest patch has been working fine for me. i think jason cleared up any remaining reset/reusable issues with this latest patch.
          Hide
          Yonik Seeley added a comment -

          Committed. Thanks everyone!

          Show
          Yonik Seeley added a comment - Committed. Thanks everyone!
          Hide
          Grant Ingersoll added a comment -

          Bulk close for Solr 1.4

          Show
          Grant Ingersoll added a comment - Bulk close for Solr 1.4

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Tom Burton-West
            • Votes:
              3 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development