Lucene - Core
  1. Lucene - Core
  2. LUCENE-4955

NGramTokenFilter increments positions for each gram

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.3
    • Fix Version/s: 4.4, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      NGramTokenFilter increments positions for each gram rather for the actual token which can lead to rather funny problems especially with highlighting. if this filter should be used for highlighting is a different story but today this seems to be a common practice in many situations to highlight sub-term matches.

      I have a test for highlighting that uses ngram failing with a StringIOOB since tokens are sorted by position which causes offsets to be mixed up due to ngram token filter.

      1. highlighter-test.patch
        4 kB
        Simon Willnauer
      2. highlighter-test.patch
        3 kB
        Simon Willnauer
      3. LUCENE-4955.patch
        42 kB
        Adrien Grand
      4. LUCENE-4955.patch
        12 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          here is a patch including the deprecation and version support (I bet somebody relies on this behaviour) ...for trunk I will remove the deprecated methods but I added it here to make sure I don't miss it.

          the second patch is a test for the highlighter that produces a SIOOB without the patch but succeeds with the positions patch.

          Show
          Simon Willnauer added a comment - here is a patch including the deprecation and version support (I bet somebody relies on this behaviour) ...for trunk I will remove the deprecated methods but I added it here to make sure I don't miss it. the second patch is a test for the highlighter that produces a SIOOB without the patch but succeeds with the positions patch.
          Hide
          Jack Krupansky added a comment -

          I think that ngram filter and edge-ngram filter are rather different cases.

          With edge-ngram it is abundantly clear that all of the edge ngrams "stack up" at the same position (at least for "front" edge ngrams!). But embedded ngrams seem more like a stretching out of the token, from one token to a sequence of tokens. Actually, it is k overlayed sequences, where k = maxGramSize minus minGramSize plus 1.

          I think the solution should be to have a "mode" which indicates whether the "intent" is merely variations (sub-tokens) for the token at the same position vs. a stretching the token into a sequence of tokens. Maybe call it "expansionMode": "stack" vs. "sequence".

          But even for the latter, I would definitely recommend that each of the k sequences should restart the position at the original token position.

          Show
          Jack Krupansky added a comment - I think that ngram filter and edge-ngram filter are rather different cases. With edge-ngram it is abundantly clear that all of the edge ngrams "stack up" at the same position (at least for "front" edge ngrams!). But embedded ngrams seem more like a stretching out of the token, from one token to a sequence of tokens. Actually, it is k overlayed sequences, where k = maxGramSize minus minGramSize plus 1. I think the solution should be to have a "mode" which indicates whether the "intent" is merely variations (sub-tokens) for the token at the same position vs. a stretching the token into a sequence of tokens. Maybe call it "expansionMode": "stack" vs. "sequence". But even for the latter, I would definitely recommend that each of the k sequences should restart the position at the original token position.
          Hide
          Adrien Grand added a comment -

          Given that offsets can't go backwards and that tokens in the same position must have the same start offset, I think that the only way to get NGramTokenFilter out of TestRandomChains' exclusion list (LUCENE-4641) is to fix position increments (this issue), change the order tokens are emitted in (LUCENE-3920) and stop modifying offsets? I know some people rely on the current behavior but I think it's more important to get this filter out of TestRandomChains' exclusions since it causes highlighting bugs and makes the term vectors files unnecessary larger.

          Show
          Adrien Grand added a comment - Given that offsets can't go backwards and that tokens in the same position must have the same start offset, I think that the only way to get NGramTokenFilter out of TestRandomChains' exclusion list ( LUCENE-4641 ) is to fix position increments (this issue), change the order tokens are emitted in ( LUCENE-3920 ) and stop modifying offsets? I know some people rely on the current behavior but I think it's more important to get this filter out of TestRandomChains' exclusions since it causes highlighting bugs and makes the term vectors files unnecessary larger.
          Hide
          Robert Muir added a comment -

          +1 Adrien. these analysis components should either be fixed or removed.

          We can speed up the process now by changing IndexWriter to reject this kinda bogus shit. We shouldnt be putting broken data into e.g. term vectors. That should encourage the fixing process.

          Show
          Robert Muir added a comment - +1 Adrien. these analysis components should either be fixed or removed. We can speed up the process now by changing IndexWriter to reject this kinda bogus shit. We shouldnt be putting broken data into e.g. term vectors. That should encourage the fixing process.
          Hide
          Simon Willnauer added a comment -

          We can speed up the process now by changing IndexWriter to reject this kinda bogus shit. We shouldnt be putting broken data into e.g. term vectors. That should encourage the fixing process.

          +1

          I updated the highlighter test and added analysis-common as a test dependency such that this can be run with ant.

          Show
          Simon Willnauer added a comment - We can speed up the process now by changing IndexWriter to reject this kinda bogus shit. We shouldnt be putting broken data into e.g. term vectors. That should encourage the fixing process. +1 I updated the highlighter test and added analysis-common as a test dependency such that this can be run with ant.
          Hide
          Adrien Grand added a comment -

          +1

          I'll work on fixing NGramTokenizer and NGramTokenFilter.

          Show
          Adrien Grand added a comment - +1 I'll work on fixing NGramTokenizer and NGramTokenFilter.
          Hide
          Robert Muir added a comment -

          I don't think we should add analysis-common as a test dependency to the highlighter. I worked pretty hard to clean all this up with e.g. mocktokenizer so we didnt have dependency hell. It also keeps our tests clean.

          Show
          Robert Muir added a comment - I don't think we should add analysis-common as a test dependency to the highlighter. I worked pretty hard to clean all this up with e.g. mocktokenizer so we didnt have dependency hell. It also keeps our tests clean.
          Hide
          Simon Willnauer added a comment -

          robert I agree, I added this as sep. patch to make sure that whatever we commit here we can at least test that the ngram filter doesn't throw an IOOB anymore. I just wanted to make it easier to run the test.

          Show
          Simon Willnauer added a comment - robert I agree, I added this as sep. patch to make sure that whatever we commit here we can at least test that the ngram filter doesn't throw an IOOB anymore. I just wanted to make it easier to run the test.
          Hide
          Adrien Grand added a comment -

          I tried to iterate on Simon's patch:

          • NGramTokenFilter doesn't modify offsets and emits all n-grams of a single term at the same position
          • NGramTokenizer uses a sliding window.
          • NGramTokenizer and NGramTokenFilter removed from TestRandomChains exclusions.

          It was very hard to add the compatibility version support to NGramTokenizer so there are now two distinct classes and the factory picks the right one depending on the Lucene match version.

          Simon's highlighting test now fails because the highlighted content is different, but not because of a broken token stream.

          Show
          Adrien Grand added a comment - I tried to iterate on Simon's patch: NGramTokenFilter doesn't modify offsets and emits all n-grams of a single term at the same position NGramTokenizer uses a sliding window. NGramTokenizer and NGramTokenFilter removed from TestRandomChains exclusions. It was very hard to add the compatibility version support to NGramTokenizer so there are now two distinct classes and the factory picks the right one depending on the Lucene match version. Simon's highlighting test now fails because the highlighted content is different, but not because of a broken token stream.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] jpountz
          http://svn.apache.org/viewvc?view=revision&revision=1476135

          LUCENE-4955: Fix NGramTokenizer and NGramTokenFilter, and remove them from TestRandomChains' exclusion list.

          Show
          Commit Tag Bot added a comment - [trunk commit] jpountz http://svn.apache.org/viewvc?view=revision&revision=1476135 LUCENE-4955 : Fix NGramTokenizer and NGramTokenFilter, and remove them from TestRandomChains' exclusion list.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] jpountz
          http://svn.apache.org/viewvc?view=revision&revision=1476159

          LUCENE-4955: Fix NGramTokenizer and NGramTokenFilter, and remove them from TestRandomChains' exclusion list (merged from r1476135).

          In addition to the trunk changes, I had to fix SlowSynonymFilterFactory to reset the token stream before consuming it.

          Show
          Commit Tag Bot added a comment - [branch_4x commit] jpountz http://svn.apache.org/viewvc?view=revision&revision=1476159 LUCENE-4955 : Fix NGramTokenizer and NGramTokenFilter, and remove them from TestRandomChains' exclusion list (merged from r1476135). In addition to the trunk changes, I had to fix SlowSynonymFilterFactory to reset the token stream before consuming it.
          Hide
          Steve Rowe added a comment -

          If there are no objections, I'd like to backport this to 4.3.1.

          Show
          Steve Rowe added a comment - If there are no objections, I'd like to backport this to 4.3.1.
          Hide
          Simon Willnauer added a comment -

          Steve, I don't think we should do this unless we add a Version.LUCENE_4_3_1. The problem here is that this filter would introduce a totally different behaviour compared to the 4.3 version that we released. If you build your index with 4.3 you would need to reindex if you switch to 4.3.1 unless we add the version so we can fallback on the old behaviour. Makes sense?

          Show
          Simon Willnauer added a comment - Steve, I don't think we should do this unless we add a Version.LUCENE_4_3_1. The problem here is that this filter would introduce a totally different behaviour compared to the 4.3 version that we released. If you build your index with 4.3 you would need to reindex if you switch to 4.3.1 unless we add the version so we can fallback on the old behaviour. Makes sense?
          Hide
          Steve Rowe added a comment -

          Steve, I don't think we should do this unless we add a Version.LUCENE_4_3_1. The problem here is that this filter would introduce a totally different behaviour compared to the 4.3 version that we released. If you build your index with 4.3 you would need to reindex if you switch to 4.3.1 unless we add the version so we can fallback on the old behaviour. Makes sense?

          I agree, Simon. I'll remove the lucene-4.3.1-candidate label.

          Show
          Steve Rowe added a comment - Steve, I don't think we should do this unless we add a Version.LUCENE_4_3_1. The problem here is that this filter would introduce a totally different behaviour compared to the 4.3 version that we released. If you build your index with 4.3 you would need to reindex if you switch to 4.3.1 unless we add the version so we can fallback on the old behaviour. Makes sense? I agree, Simon. I'll remove the lucene-4.3.1-candidate label.
          Hide
          Steve Rowe added a comment -

          Bulk close resolved 4.4 issues

          Show
          Steve Rowe added a comment - Bulk close resolved 4.4 issues

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development