Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Now that we fixed NGramTokenizer and NGramTokenFilter to not produce corrupt token streams, the only way to have "true" offsets for n-grams is to use the tokenizer (the filter emits the offsets of the original token).

      Yet, our NGramTokenizer has a few flaws, in particular:

      • it doesn't have the ability to pre-tokenize the input stream, for example on whitespaces,
      • it doesn't play nice with surrogate pairs.

      Since we already broke backward compatibility for it in 4.4, I'd like to also fix these issues before we release.

      1. LUCENE-5042.patch
        66 kB
        Adrien Grand
      2. LUCENE-5042.patch
        16 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Patch:

        • Computes n-grams based on unicode code points instead of java chars
        • Adds the ability to split the input stream on some chars like CharTokenizer
        Show
        Adrien Grand added a comment - Patch: Computes n-grams based on unicode code points instead of java chars Adds the ability to split the input stream on some chars like CharTokenizer
        Hide
        Simon Willnauer added a comment -

        hey adrien, this looks very cool! I have a couple of minor comments:

        • can we factor out the toCodepoints calculation into a method in for instance CharacterUtils I think we use this elsewhere as well in a similar way and you might want to reuse it in the future as well.
        • can we have a comment on NGramTokenizer that every method should be final except of isTokenChar
        • if you can think of a hard limit for the while(true) loop in NGramTokenizer can we add an assert that makes sure we always make progress ie. never walk backwards or don't consume anything? not sure if it is posssible.
        • can you use more parentesis for readability like in:
        if (gramSize > maxGram || bufferStart + gramSize > bufferEnd) 
        // vs.
        
        if (gramSize > maxGram || (bufferStart + gramSize) > bufferEnd) 
        
        Show
        Simon Willnauer added a comment - hey adrien, this looks very cool! I have a couple of minor comments: can we factor out the toCodepoints calculation into a method in for instance CharacterUtils I think we use this elsewhere as well in a similar way and you might want to reuse it in the future as well. can we have a comment on NGramTokenizer that every method should be final except of isTokenChar if you can think of a hard limit for the while(true) loop in NGramTokenizer can we add an assert that makes sure we always make progress ie. never walk backwards or don't consume anything? not sure if it is posssible. can you use more parentesis for readability like in: if (gramSize > maxGram || bufferStart + gramSize > bufferEnd) // vs. if (gramSize > maxGram || (bufferStart + gramSize) > bufferEnd)
        Hide
        Adrien Grand added a comment -

        Thanks for the review Simon, here is a new patch that should satisfy your concerns. Additionally,

        • it also fixes the other (edge) n-gram tokenizers and filters
        • I factored out some methods into CharacterUtils
        • I went into a bug because Character.codePointAt(char[], int) doesn't know the end of the char[], this can be a problem when working with buffers which are not filled up so I made this API forbidden and fixed other places that relied on it: codePointAt(char[], int, int) looks safer to me.
        • I changed the CharacterUtil.fill API so that it reads fully (which it didn't do although the documentation stated it does).
        Show
        Adrien Grand added a comment - Thanks for the review Simon, here is a new patch that should satisfy your concerns. Additionally, it also fixes the other (edge) n-gram tokenizers and filters I factored out some methods into CharacterUtils I went into a bug because Character.codePointAt(char[], int) doesn't know the end of the char[], this can be a problem when working with buffers which are not filled up so I made this API forbidden and fixed other places that relied on it: codePointAt(char[], int, int) looks safer to me. I changed the CharacterUtil.fill API so that it reads fully (which it didn't do although the documentation stated it does).
        Hide
        Simon Willnauer added a comment -

        Wow! this looks very cool! I wonder if we should rename the CharacterUtils classes into UTF32CharUtils and UTF16CharUtils rather than Java5 and Java4 I think this makes more sense in terms of how we use it today?

        Otherwise I am +1 to commit! good stuff Adrien!

        Show
        Simon Willnauer added a comment - Wow! this looks very cool! I wonder if we should rename the CharacterUtils classes into UTF32CharUtils and UTF16CharUtils rather than Java5 and Java4 I think this makes more sense in terms of how we use it today? Otherwise I am +1 to commit! good stuff Adrien!
        Hide
        Simon Willnauer added a comment -

        I opened LUCENE-5054 for this

        Show
        Simon Willnauer added a comment - I opened LUCENE-5054 for this
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] jpountz
        http://svn.apache.org/viewvc?view=revision&revision=1492185

        LUCENE-5042: Fix the n-gram tokenizers and filters.

        This commit fixes n-gram tokenizers and filters so that they handle
        supplementary characters correctly and adds the ability to pre-tokenize the
        stream in tokenizers.

        Show
        Commit Tag Bot added a comment - [trunk commit] jpountz http://svn.apache.org/viewvc?view=revision&revision=1492185 LUCENE-5042 : Fix the n-gram tokenizers and filters. This commit fixes n-gram tokenizers and filters so that they handle supplementary characters correctly and adds the ability to pre-tokenize the stream in tokenizers.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] jpountz
        http://svn.apache.org/viewvc?view=revision&revision=1492231

        LUCENE-5042: Fix the n-gram tokenizers and filters.

        Show
        Commit Tag Bot added a comment - [branch_4x commit] jpountz http://svn.apache.org/viewvc?view=revision&revision=1492231 LUCENE-5042 : Fix the n-gram tokenizers and filters.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] jpountz
        http://svn.apache.org/viewvc?view=revision&revision=1492257

        LUCENE-5042: Reset the CharBuffer in TokenStream.reset().

        Show
        Commit Tag Bot added a comment - [trunk commit] jpountz http://svn.apache.org/viewvc?view=revision&revision=1492257 LUCENE-5042 : Reset the CharBuffer in TokenStream.reset().
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] jpountz
        http://svn.apache.org/viewvc?view=revision&revision=1492259

        LUCENE-5042: Reset the CharBuffer in TokenStream.reset().

        Show
        Commit Tag Bot added a comment - [branch_4x commit] jpountz http://svn.apache.org/viewvc?view=revision&revision=1492259 LUCENE-5042 : Reset the CharBuffer in TokenStream.reset().
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] sarowe
        http://svn.apache.org/viewvc?view=revision&revision=1492420

        LUCENE-5042: Maven configuration: add chars.txt to forbiddenapis config

        Show
        Commit Tag Bot added a comment - [trunk commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1492420 LUCENE-5042 : Maven configuration: add chars.txt to forbiddenapis config
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] sarowe
        http://svn.apache.org/viewvc?view=revision&revision=1492422

        LUCENE-5042: Maven configuration: add chars.txt to forbiddenapis config (merged trunk r1492420)

        Show
        Commit Tag Bot added a comment - [branch_4x commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1492422 LUCENE-5042 : Maven configuration: add chars.txt to forbiddenapis config (merged trunk r1492420)
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] jpountz
        http://svn.apache.org/viewvc?view=revision&revision=1492543

        LUCENE-5042: Refuse to convert if the length is negative.

        Show
        Commit Tag Bot added a comment - [trunk commit] jpountz http://svn.apache.org/viewvc?view=revision&revision=1492543 LUCENE-5042 : Refuse to convert if the length is negative.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] jpountz
        http://svn.apache.org/viewvc?view=revision&revision=1492549

        LUCENE-5042: Refuse to convert if the length is negative.

        Show
        Commit Tag Bot added a comment - [branch_4x commit] jpountz http://svn.apache.org/viewvc?view=revision&revision=1492549 LUCENE-5042 : Refuse to convert if the length is negative.
        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development