Lucene - Core
  1. Lucene - Core
  2. LUCENE-3113

fix analyzer bugs found by MockTokenizer

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched over the analysis tests to use MockTokenizer for better coverage.

      However, this found a few bugs (one of which is LUCENE-3106):

      • incrementToken() after it returns false in CommonGramsQueryFilter, HyphenatedWordsFilter, ShingleFilter, SynonymFilter
      • missing end() implementation for PrefixAwareTokenFilter
      • double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase
      • missing correctOffset()s in MockTokenizer itself.

      I think it would be nice to just fix all the bugs on one issue... I've fixed everything except Shingle and Synonym

      1. LUCENE-3113.patch
        113 kB
        Robert Muir
      2. LUCENE-3113.patch
        109 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          attached is a patch, the synonyms and shingles tests still fail.

          Show
          Robert Muir added a comment - attached is a patch, the synonyms and shingles tests still fail.
          Hide
          Robert Muir added a comment -

          updated patch, fixing the bugs in Synonyms and ShingleFilter.

          also, i found two more bugs: the ShingleAnalyzerWrapper was double-resetting, and the PrefixAndSuffixAwareTokenFilter was missing end() also

          Show
          Robert Muir added a comment - updated patch, fixing the bugs in Synonyms and ShingleFilter. also, i found two more bugs: the ShingleAnalyzerWrapper was double-resetting, and the PrefixAndSuffixAwareTokenFilter was missing end() also
          Hide
          Robert Muir added a comment -

          I think this patch is ready to commit, i'll wait and see if anyone feels like reviewing it

          Show
          Robert Muir added a comment - I think this patch is ready to commit, i'll wait and see if anyone feels like reviewing it
          Hide
          Uwe Schindler added a comment -

          A quick check on the fixes in the implementations: all fine. I was just confused about PrefixAndSuffixAwareTF, but thats fine (Robert explained it to me - this Filters are very complicated from the code/class hierarchy design g).

          I did not verify the Tests, I assume its just dumb search-replacements.

          Show
          Uwe Schindler added a comment - A quick check on the fixes in the implementations: all fine. I was just confused about PrefixAndSuffixAwareTF, but thats fine (Robert explained it to me - this Filters are very complicated from the code/class hierarchy design g ). I did not verify the Tests, I assume its just dumb search-replacements.
          Hide
          Robert Muir added a comment -

          Uwe, I think i'll open a followup issue to clean up the code about PrefixAndSuffixAwareTF. I don't like how tricky it is.

          Show
          Robert Muir added a comment - Uwe, I think i'll open a followup issue to clean up the code about PrefixAndSuffixAwareTF. I don't like how tricky it is.
          Hide
          Steve Rowe added a comment -

          +1

          the ShingleAnalyzerWrapper was double-resetting

          Your patch just removes the reset call:

          @@ -201,7 +201,6 @@
                 TokenStream result = defaultAnalyzer.reusableTokenStream(fieldName, reader);
                 if (result == streams.wrapped) {
                   /* the wrapped analyzer reused the stream */
          -        streams.shingle.reset(); 
                 } else {
                   /* the wrapped analyzer did not, create a new shingle around the new one */
                   streams.wrapped = result;
          

          but inverting the condition would read better:

                 TokenStream result = defaultAnalyzer.reusableTokenStream(fieldName, reader);
          -      if (result == streams.wrapped) {
          -        /* the wrapped analyzer reused the stream */
          -        streams.shingle.reset(); 
          -      } else {
          -        /* the wrapped analyzer did not, create a new shingle around the new one */
          +      if (result != streams.wrapped) {
          +        // The wrapped analyzer did not reuse the stream. 
          +        // Wrap the new stream with a new ShingleFilter.
                   streams.wrapped = result;
                   streams.shingle = new ShingleFilter(streams.wrapped);
                 }
          
          Show
          Steve Rowe added a comment - +1 the ShingleAnalyzerWrapper was double-resetting Your patch just removes the reset call: @@ -201,7 +201,6 @@ TokenStream result = defaultAnalyzer.reusableTokenStream(fieldName, reader); if (result == streams.wrapped) { /* the wrapped analyzer reused the stream */ - streams.shingle.reset(); } else { /* the wrapped analyzer did not, create a new shingle around the new one */ streams.wrapped = result; but inverting the condition would read better: TokenStream result = defaultAnalyzer.reusableTokenStream(fieldName, reader); - if (result == streams.wrapped) { - /* the wrapped analyzer reused the stream */ - streams.shingle.reset(); - } else { - /* the wrapped analyzer did not, create a new shingle around the new one */ + if (result != streams.wrapped) { + // The wrapped analyzer did not reuse the stream. + // Wrap the new stream with a new ShingleFilter. streams.wrapped = result; streams.shingle = new ShingleFilter(streams.wrapped); }
          Hide
          Robert Muir added a comment -

          thanks for reviewing Steven, I agree! I've made this change and will commit shortly.

          Show
          Robert Muir added a comment - thanks for reviewing Steven, I agree! I've made this change and will commit shortly.
          Hide
          Robert Muir added a comment -

          Committed revision 1104519, 1124242 (branch_3x)

          Show
          Robert Muir added a comment - Committed revision 1104519, 1124242 (branch_3x)
          Hide
          Robert Muir added a comment -

          Bulk closing for 3.2

          Show
          Robert Muir added a comment - Bulk closing for 3.2

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development