Lucene - Core
  1. Lucene - Core
  2. LUCENE-4810

Positions are incremented for each ngram in EdgeNGramTokenFilter

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Edge ngrams should be like synonyms, with all the ngrams generated from a token having the same position as that original token. The current code increments position.

      For the text "molecular biology", the query "mol bio" should match as a phrase in neighboring positions. It does not.

      You can see this in the Analysis page in the admin UI.

      1. LUCENE-4810.diff
        2 kB
        Walter Underwood
      2. LUCENE-4810.patch
        6 kB
        Michael McCandless
      3. LUCENE-4810.patch
        4 kB
        Michael McCandless
      4. LUCENE-4810-first-token-position-increment.patch
        4 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          Walter Underwood added a comment -

          Patch based on the 4.x source. The filenames are a bit odd because I was developing on 3.3.0.

          Show
          Walter Underwood added a comment - Patch based on the 4.x source. The filenames are a bit odd because I was developing on 3.3.0.
          Hide
          Robert Muir added a comment -

          At a glance I think I like the idea myself. I don't like tokenfilters that 'retokenize' by changing up the positions, i think it causes all kinds of havoc.

          Instead with this patch, it simplifies what this filter is doing conceptually: for each word, all of its prefixes are added as synonyms.

          Show
          Robert Muir added a comment - At a glance I think I like the idea myself. I don't like tokenfilters that 'retokenize' by changing up the positions, i think it causes all kinds of havoc. Instead with this patch, it simplifies what this filter is doing conceptually: for each word, all of its prefixes are added as synonyms.
          Hide
          David Smiley added a comment -

          +1

          Show
          David Smiley added a comment - +1
          Hide
          Michael McCandless added a comment -

          Patch looks good, thanks Walter!

          I added a CHANGES entry and a test case ... I think it's ready.

          Show
          Michael McCandless added a comment - Patch looks good, thanks Walter! I added a CHANGES entry and a test case ... I think it's ready.
          Hide
          Walter Underwood added a comment -

          Thanks for the test case. I noticed that the change didn't cause a test failure. Oops.

          Show
          Walter Underwood added a comment - Thanks for the test case. I noticed that the change didn't cause a test failure. Oops.
          Hide
          Michael McCandless added a comment -

          I noticed that the change didn't cause a test failure.

          Yeah, spooky

          Show
          Michael McCandless added a comment - I noticed that the change didn't cause a test failure. Yeah, spooky
          Hide
          Michael McCandless added a comment -

          New patch, also fixing EdgeNGramTokenizer to be consistent (set posInc=0 for all but the first gram).

          Show
          Michael McCandless added a comment - New patch, also fixing EdgeNGramTokenizer to be consistent (set posInc=0 for all but the first gram).
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Michael McCandless
          http://svn.apache.org/viewvc?view=revision&revision=1453934

          LUCENE-4810: don't increment position on every gram (only the first, for a given input token) in EdgeNGramTokenizer/Filter

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1453934 LUCENE-4810 : don't increment position on every gram (only the first, for a given input token) in EdgeNGramTokenizer/Filter
          Hide
          Michael McCandless added a comment -

          Thanks Walter!

          Show
          Michael McCandless added a comment - Thanks Walter!
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Michael McCandless
          http://svn.apache.org/viewvc?view=revision&revision=1453937

          LUCENE-4810: don't increment position on every gram (only the first, for a given input token) in EdgeNGramTokenizer/Filter

          Show
          Commit Tag Bot added a comment - [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1453937 LUCENE-4810 : don't increment position on every gram (only the first, for a given input token) in EdgeNGramTokenizer/Filter
          Hide
          Steve Rowe added a comment -

          On the 4.3 vote thread, Karol Sikora reported seeing an exception when EdgeNGramTokenFilter followed the Morfologik lemmatizing filter. My summary of the new EdgeNGramTokenFilter bug is on the thread here: http://markmail.org/message/7dkd6edz2p7fat2h.

          As of the changes committed on this issue, EdgeNGramTokenFilter passes through as-is the position increment on its first emitted token for a given input token. When the input token has a position increment of 0, e.g. for synonyms, and this is the first output token, EdgeNGramTokenFilter is then guilty of producing a stream whose first token has a position increment of 0.

          Show
          Steve Rowe added a comment - On the 4.3 vote thread, Karol Sikora reported seeing an exception when EdgeNGramTokenFilter followed the Morfologik lemmatizing filter. My summary of the new EdgeNGramTokenFilter bug is on the thread here: http://markmail.org/message/7dkd6edz2p7fat2h . As of the changes committed on this issue, EdgeNGramTokenFilter passes through as-is the position increment on its first emitted token for a given input token. When the input token has a position increment of 0, e.g. for synonyms, and this is the first output token, EdgeNGramTokenFilter is then guilty of producing a stream whose first token has a position increment of 0.
          Hide
          Steve Rowe added a comment -

          Patch fixing the first-token-zero-position-increment problem, along with a test (using PositionFilter) that fails without the patch and succeeds with it.

          Show
          Steve Rowe added a comment - Patch fixing the first-token-zero-position-increment problem, along with a test (using PositionFilter) that fails without the patch and succeeds with it.
          Hide
          Steve Rowe added a comment -

          I see FilteringTokenFilter has a similar first token pos incr fixup, so this problem is handled in at least one other filter. I'll take a broader look tomorrow.

          Show
          Steve Rowe added a comment - I see FilteringTokenFilter has a similar first token pos incr fixup, so this problem is handled in at least one other filter. I'll take a broader look tomorrow.
          Hide
          Michael McCandless added a comment -

          Ugh Nice catch Karol and Steve.

          TestRandomChains would have caught this ... we really need to fix the issues (eg LUCENE-3920) in Edge/NGramTokenizer/Filter so they aren't excluded from TestRandomChains rotation ...

          Show
          Michael McCandless added a comment - Ugh Nice catch Karol and Steve. TestRandomChains would have caught this ... we really need to fix the issues (eg LUCENE-3920 ) in Edge/NGramTokenizer/Filter so they aren't excluded from TestRandomChains rotation ...
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] sarowe
          http://svn.apache.org/viewvc?view=revision&revision=1470496

          LUCENE-4810: first output token from EdgeNGramTokenFilter must be > 0

          Show
          Commit Tag Bot added a comment - [trunk commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1470496 LUCENE-4810 : first output token from EdgeNGramTokenFilter must be > 0
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] sarowe
          http://svn.apache.org/viewvc?view=revision&revision=1470497

          LUCENE-4810: first output token from EdgeNGramTokenFilter must be > 0 (merged trunk r1470496)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1470497 LUCENE-4810 : first output token from EdgeNGramTokenFilter must be > 0 (merged trunk r1470496)
          Hide
          Commit Tag Bot added a comment -

          [lucene_solr_4_3 commit] sarowe
          http://svn.apache.org/viewvc?view=revision&revision=1470502

          LUCENE-4810: first output token from EdgeNGramTokenFilter must be > 0 (merged trunk r1470496)

          Show
          Commit Tag Bot added a comment - [lucene_solr_4_3 commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1470502 LUCENE-4810 : first output token from EdgeNGramTokenFilter must be > 0 (merged trunk r1470496)
          Hide
          Steve Rowe added a comment -

          FWIW, I fixed the svn:log property on the 1470496, 1470497, and 1470502 revisions to be "LUCENE-4810: position increment for first output token from EdgeNGramTokenFilter must be > 0".

          Show
          Steve Rowe added a comment - FWIW, I fixed the svn:log property on the 1470496, 1470497, and 1470502 revisions to be " LUCENE-4810 : position increment for first output token from EdgeNGramTokenFilter must be > 0".
          Hide
          Steve Rowe added a comment -

          Committed fix to insure first token position increment > 0, to trunk, branch_4x, and lucene_solr_4_3.

          Show
          Steve Rowe added a comment - Committed fix to insure first token position increment > 0, to trunk, branch_4x, and lucene_solr_4_3.
          Hide
          Michael McCandless added a comment -

          Thanks Steve!

          Show
          Michael McCandless added a comment - Thanks Steve!
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Walter Underwood
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development