Lucene - Core
  1. Lucene - Core
  2. LUCENE-3907

Improve the Edge/NGramTokenizer/Filters

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4
    • Component/s: None
    • Labels:
    • Lucene Fields:
      New

      Description

      Our ngram tokenizers/filters could use some love. EG, they output ngrams in multiple passes, instead of "stacked", which messes up offsets/positions and requires too much buffering (can hit OOME for long tokens). They clip at 1024 chars (tokenizers) but don't (token filters). The split up surrogate pairs incorrectly.

      1. LUCENE-3907.patch
        43 kB
        Adrien Grand

        Issue Links

          Activity

          Steve Rowe made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Steve Rowe added a comment -

          Bulk close resolved 4.4 issues

          Show
          Steve Rowe added a comment - Bulk close resolved 4.4 issues
          Adrien Grand made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Uwe Schindler made changes -
          Fix Version/s 4.4 [ 12324323 ]
          Fix Version/s 4.3 [ 12324143 ]
          Hide
          Adrien Grand added a comment -

          "previous behavior" (incremented position) is simply NOT linked to front vs. back. I'm not sure why you are claiming that it is!

          Indeed these issues are unrelated, and backward n-graming doesn't cause highlighting issues. Sorry if I seemed to mean the opposite, it was not intentional.

          My main motivation was to fix the positions/offsets bugs. I also deprecated support for backward n-graming since there seemed to be lazy consensus: as Uwe noted, backward n-graming can be obtained by applying ReverseStringFilter, then EdgeNGramTokenFilter and then ReverseStringFilter again. This helps make filters simpler, hence easier to understand and to test.

          So now, here is how you would use filters depending on whether you want front or back n-graming and with or without the new positions/offsets.

            previous positions/offsets (broken) new positions/offsets
          front n-graming EdgeNGramTokenFilter(version=LUCENE_43,side=FRONT) EdgeNGramTokenFilter(version=LUCENE_44,side=FRONT)
          back n-graming EdgeNGramTokenFilter(version=LUCENE_43,side=BACK) ReverseStringFilter, EdgeNGramTokenFilter(version=LUCENE_44,side=FRONT), ReverseStringFilter

          It is true that the patch prevents users from constructing EdgeNGramTokenFilter with version>=LUCENE_44 and side=BACK to encourage users to upgrade their analysis chain. But if you think we should allow for it, I'm open for discussion.

          Show
          Adrien Grand added a comment - "previous behavior" (incremented position) is simply NOT linked to front vs. back. I'm not sure why you are claiming that it is! Indeed these issues are unrelated, and backward n-graming doesn't cause highlighting issues. Sorry if I seemed to mean the opposite, it was not intentional. My main motivation was to fix the positions/offsets bugs. I also deprecated support for backward n-graming since there seemed to be lazy consensus: as Uwe noted, backward n-graming can be obtained by applying ReverseStringFilter, then EdgeNGramTokenFilter and then ReverseStringFilter again. This helps make filters simpler, hence easier to understand and to test. So now, here is how you would use filters depending on whether you want front or back n-graming and with or without the new positions/offsets.   previous positions/offsets (broken) new positions/offsets front n-graming EdgeNGramTokenFilter(version=LUCENE_43,side=FRONT) EdgeNGramTokenFilter(version=LUCENE_44,side=FRONT) back n-graming EdgeNGramTokenFilter(version=LUCENE_43,side=BACK) ReverseStringFilter, EdgeNGramTokenFilter(version=LUCENE_44,side=FRONT), ReverseStringFilter It is true that the patch prevents users from constructing EdgeNGramTokenFilter with version>=LUCENE_44 and side=BACK to encourage users to upgrade their analysis chain. But if you think we should allow for it, I'm open for discussion.
          Hide
          Jack Krupansky added a comment -

          Look, the "fix" of position bugs here is to keep the position the same for all tokens, right? And that logic can simply be applied to "back" as well, for the same reasons and with the same effect. So, how could "back" - which should apply that same position logic be a separate cause of "highlighting bugs"?

          "previous behavior" (incremented position) is simply NOT linked to front vs. back. I'm not sure why you are claiming that it is!

          The Jira record simply shows that some people "want" to eliminate a feature... not that the feature (if fixed in the same manner as the rest of the fix) "could trigger highlighting bugs" - unless I'm missing something, and if I'm missing something it is because you are not stating it clearly! So, please do so.

          Show
          Jack Krupansky added a comment - Look, the "fix" of position bugs here is to keep the position the same for all tokens, right? And that logic can simply be applied to "back" as well, for the same reasons and with the same effect. So, how could "back" - which should apply that same position logic be a separate cause of "highlighting bugs"? "previous behavior" (incremented position) is simply NOT linked to front vs. back. I'm not sure why you are claiming that it is! The Jira record simply shows that some people "want" to eliminate a feature... not that the feature (if fixed in the same manner as the rest of the fix) "could trigger highlighting bugs" - unless I'm missing something, and if I'm missing something it is because you are not stating it clearly! So, please do so.
          Hide
          Adrien Grand added a comment -

          The previous behaviour could trigger highlighting bugs so I think it is important that we fix it in 4.x. In case the broken behaviour is still needed, it can be emulated by providing Version.LUCENE_43 as the Lucene match version.

          Show
          Adrien Grand added a comment - The previous behaviour could trigger highlighting bugs so I think it is important that we fix it in 4.x. In case the broken behaviour is still needed, it can be emulated by providing Version.LUCENE_43 as the Lucene match version.
          Hide
          Jack Krupansky added a comment -

          Why not make the big changes for trunk/5.0, but leave the existing filters/tokenziers in 4.x as deprecated. Add the "leading" replacements as well in 4x, but be sure to preserve the existing stuff with support for "back" in 4x - as deprecated.

          Show
          Jack Krupansky added a comment - Why not make the big changes for trunk/5.0, but leave the existing filters/tokenziers in 4.x as deprecated. Add the "leading" replacements as well in 4x, but be sure to preserve the existing stuff with support for "back" in 4x - as deprecated.
          Hide
          Adrien Grand added a comment -

          As Steve suggested, I think these tokenizers/filters need to be renamed (trunk only) since they don't support backward graming anymore. Please don't hesitate to let me know if you have a good idea for a name, otherwise I plan to rename them to "Leading..." in the next few days.

          Show
          Adrien Grand added a comment - As Steve suggested, I think these tokenizers/filters need to be renamed (trunk only) since they don't support backward graming anymore. Please don't hesitate to let me know if you have a good idea for a name, otherwise I plan to rename them to "Leading..." in the next few days.
          Adrien Grand committed 1479892 (16 files)
          Reviews: none

          LUCENE-3907: Fix EdgeNGramTokenizer and EdgeNGramTokenFilter to not generate corrupt token stream graphs.

          Adrien Grand made changes -
          Assignee Uwe Schindler [ thetaphi ] Adrien Grand [ jpountz ]
          Hide
          Uwe Schindler added a comment -

          Hi Adrien, thanks for the fixes. You can take the issue and assign it to you!

          Uwe

          Show
          Uwe Schindler added a comment - Hi Adrien, thanks for the fixes. You can take the issue and assign it to you! Uwe
          Adrien Grand made changes -
          Attachment LUCENE-3907.patch [ 12581988 ]
          Hide
          Adrien Grand added a comment -

          Here is a patch that fixes EdgeNGramTokenizer and EdgeNGramTokenFilter so that they pass TestRandomChains.

          It also deprecates Side.BACK so when committing, I propose to follow Steve's advice to rename them to LeadingNGramTokenizer and LeadingNGramTokenFilter in trunk.

          Show
          Adrien Grand added a comment - Here is a patch that fixes EdgeNGramTokenizer and EdgeNGramTokenFilter so that they pass TestRandomChains. It also deprecates Side.BACK so when committing, I propose to follow Steve's advice to rename them to LeadingNGramTokenizer and LeadingNGramTokenFilter in trunk.
          Adrien Grand made changes -
          Labels gsoc2012 lucene-gsoc-12 gsoc2013
          Robert Muir made changes -
          Fix Version/s 4.3 [ 12324143 ]
          Fix Version/s 4.2 [ 12323899 ]
          Hide
          Steve Rowe added a comment -

          Edge is the wrong name for something that only works on one edge. Maybe rename to LeadingNgram?

          Show
          Steve Rowe added a comment - Edge is the wrong name for something that only works on one edge. Maybe rename to LeadingNgram?
          Hide
          Uwe Schindler added a comment -

          Back grams would work for leading wildcards. They might be useful for things where the head is at the end (tail-first?), like domain names.

          If you need reverse n-grams, you could always add a filter to do that afterwards. There is no need to have this as separate logic in this filter. We should split logic and keep filters as simple as possible.

          Show
          Uwe Schindler added a comment - Back grams would work for leading wildcards. They might be useful for things where the head is at the end (tail-first?), like domain names. If you need reverse n-grams, you could always add a filter to do that afterwards. There is no need to have this as separate logic in this filter. We should split logic and keep filters as simple as possible.
          Hide
          Walter Underwood added a comment -

          Back grams would work for leading wildcards. They might be useful for things where the head is at the end (tail-first?), like domain names.

          Not super-useful, but it is a small part of the code in the tokenizer.

          Show
          Walter Underwood added a comment - Back grams would work for leading wildcards. They might be useful for things where the head is at the end (tail-first?), like domain names. Not super-useful, but it is a small part of the code in the tokenizer.
          Hide
          Michael McCandless added a comment -

          I think we should remove the Side (BACK/FRONT) enum: an app can always use ReverseStringFilter if it really wants BACK grams (what are BACK grams used for?).

          Show
          Michael McCandless added a comment - I think we should remove the Side (BACK/FRONT) enum: an app can always use ReverseStringFilter if it really wants BACK grams (what are BACK grams used for?).
          Walter Underwood made changes -
          Link This issue is related to LUCENE-4810 [ LUCENE-4810 ]
          Steve Rowe made changes -
          Fix Version/s 4.2 [ 12323899 ]
          Fix Version/s 4.1 [ 12321140 ]
          Robert Muir made changes -
          Link This issue is related to LUCENE-4641 [ LUCENE-4641 ]
          Robert Muir made changes -
          Fix Version/s 4.1 [ 12321140 ]
          Fix Version/s 4.0 [ 12314025 ]
          Uwe Schindler made changes -
          Assignee Uwe Schindler [ thetaphi ]
          Hide
          Uwe Schindler added a comment -

          I would like to be the mentor for this. I wanted to fix those long time ago and I am happy somebody helps.

          P.S.: Maybe we also get a new ShingleMatrix LOL

          Show
          Uwe Schindler added a comment - I would like to be the mentor for this. I wanted to fix those long time ago and I am happy somebody helps. P.S.: Maybe we also get a new ShingleMatrix LOL
          Hide
          Michael McCandless added a comment -

          Awesome! We just need a possible mentor here... volunteers...?

          Show
          Michael McCandless added a comment - Awesome! We just need a possible mentor here... volunteers...?
          Hide
          Reinardus Surya Pradhitya added a comment -

          Hi,

          I'm interested in this project. I have done a Natural Language Processing project in language classification in which I did tokenization using Stanford's NLP tool. I'm also currently doing an Information Retrieval project in documents indexing and searching using Lucene and Weka. I might not be too familiar with Lucene's ngram tokenizer, but I have been working with NGram and Lucene before, so I believe that I would be able to learn quickly. Thanks

          Best regards,
          Reinardus Surya Pradhitya

          Show
          Reinardus Surya Pradhitya added a comment - Hi, I'm interested in this project. I have done a Natural Language Processing project in language classification in which I did tokenization using Stanford's NLP tool. I'm also currently doing an Information Retrieval project in documents indexing and searching using Lucene and Weka. I might not be too familiar with Lucene's ngram tokenizer, but I have been working with NGram and Lucene before, so I believe that I would be able to learn quickly. Thanks Best regards, Reinardus Surya Pradhitya
          Robert Muir made changes -
          Field Original Value New Value
          Link This issue is related to LUCENE-3920 [ LUCENE-3920 ]
          Michael McCandless created issue -

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Michael McCandless
            • Votes:
              2 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development