Lucene - Core
  1. Lucene - Core
  2. LUCENE-5278

MockTokenizer throws away the character right after a token even if it is a valid start to a new token

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6, 5.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      MockTokenizer throws away the character right after a token even if it is a valid start to a new token. You won't see this unless you build a tokenizer that can recognize every character like with new RegExp(".") or RegExp("...").

      Changing this behaviour seems to break a number of tests.

      1. LUCENE-5278.patch
        9 kB
        Robert Muir
      2. LUCENE-5278.patch
        6 kB
        Robert Muir
      3. LUCENE-5278.patch
        5 kB
        Nik Everett

        Activity

        Hide
        Robert Muir added a comment -

        Thanks again Nik!

        Show
        Robert Muir added a comment - Thanks again Nik!
        Hide
        ASF subversion and git services added a comment -

        Commit 1531498 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1531498 ]

        LUCENE-5278: remove CharTokenizer brain-damage from MockTokenizer so it works better with custom regular expressions

        Show
        ASF subversion and git services added a comment - Commit 1531498 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1531498 ] LUCENE-5278 : remove CharTokenizer brain-damage from MockTokenizer so it works better with custom regular expressions
        Hide
        Robert Muir added a comment -

        I committed this to trunk: I did a lot of testing locally but I want to let Jenkins have its way with it for a few hours before backporting to branch_4x.

        Show
        Robert Muir added a comment - I committed this to trunk: I did a lot of testing locally but I want to let Jenkins have its way with it for a few hours before backporting to branch_4x.
        Hide
        ASF subversion and git services added a comment -

        Commit 1531479 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1531479 ]

        LUCENE-5278: remove CharTokenizer brain-damage from MockTokenizer so it works better with custom regular expressions

        Show
        ASF subversion and git services added a comment - Commit 1531479 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1531479 ] LUCENE-5278 : remove CharTokenizer brain-damage from MockTokenizer so it works better with custom regular expressions
        Hide
        Robert Muir added a comment -

        added a few more tests to TestMockAnalyzer so all these crazy corner cases are found there and not debugging other tests

        Show
        Robert Muir added a comment - added a few more tests to TestMockAnalyzer so all these crazy corner cases are found there and not debugging other tests
        Hide
        Robert Muir added a comment -

        Nice patch Nik!

        I think this is ready: i tweaked variable names and rearranged stuff (e.g. i use -1 instead of Integer so we arent boxing and a few other things).

        I also added some unit tests.

        The main issues why tests were failing with your original patch:

        • reset() needed to clear the buffer variables.
        • the state machine needed some particular extra check when emitting a token: e.g. if you make a regex of "..", but you send it "abcde", the tokens should be "ab", "cd", but not "e". so when we end on a partial match, we have to check that we are in an accept state.
        • term-limit-exceeded is a special case (versus last character being in a reject state)
        Show
        Robert Muir added a comment - Nice patch Nik! I think this is ready: i tweaked variable names and rearranged stuff (e.g. i use -1 instead of Integer so we arent boxing and a few other things). I also added some unit tests. The main issues why tests were failing with your original patch: reset() needed to clear the buffer variables. the state machine needed some particular extra check when emitting a token: e.g. if you make a regex of "..", but you send it "abcde", the tokens should be "ab", "cd", but not "e". so when we end on a partial match, we have to check that we are in an accept state. term-limit-exceeded is a special case (versus last character being in a reject state)
        Hide
        Robert Muir added a comment -

        I think i understand what you want: it makes sense. The only reason its the way it is today is because this thing historically came from CharTokenizer (see the isTokenChar?).

        But it would be better if you could e.g. make a pattern like ([A-Z]a-z+) and for it to actually break FooBar into Foo, Bar rather than throwout out "bar" all together.

        I'll dig into this!

        Show
        Robert Muir added a comment - I think i understand what you want: it makes sense. The only reason its the way it is today is because this thing historically came from CharTokenizer (see the isTokenChar?). But it would be better if you could e.g. make a pattern like ( [A-Z] a-z+) and for it to actually break FooBar into Foo, Bar rather than throwout out "bar" all together. I'll dig into this!
        Hide
        Nik Everett added a comment -

        This patch "fixes" the behaviour from my perspective but breaks a bunch of other tests.

        Show
        Nik Everett added a comment - This patch "fixes" the behaviour from my perspective but breaks a bunch of other tests.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Nik Everett
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development