Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      ThaiWordFilter is an offender in TestRandomChains because it creates positions and updates offsets.

      1. LUCENE-4984.patch
        49 kB
        Robert Muir
      2. LUCENE-4984.patch
        32 kB
        Robert Muir
      3. LUCENE-4984.patch
        5 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          Adrien Grand added a comment -

          Patch:

          • ThaiWordFilter does not update offsets anymore,
          • and emits all tokens generated from the same input token at the same position.
          Show
          Adrien Grand added a comment - Patch: ThaiWordFilter does not update offsets anymore, and emits all tokens generated from the same input token at the same position.
          Hide
          Robert Muir added a comment -

          I think this should be a tokenizer.

          Show
          Robert Muir added a comment - I think this should be a tokenizer.
          Hide
          Adrien Grand added a comment -

          Good point, I'll update the patch to create a ThaiTokenizer so that we can just completely deprecate this filter.

          Show
          Adrien Grand added a comment - Good point, I'll update the patch to create a ThaiTokenizer so that we can just completely deprecate this filter.
          Hide
          Robert Muir added a comment -

          tokenizing from a breakiterator can get a little tricky.

          we had some support for this (it should be re-reviewed) in the initial kuromoji integration (SegmentingTokenizerBase.java and its test)
          But we ended out adding a streaming viterbi search so we didnt need it anymore:

          http://svn.apache.org/viewvc?view=revision&revision=1230748

          Show
          Robert Muir added a comment - tokenizing from a breakiterator can get a little tricky. we had some support for this (it should be re-reviewed) in the initial kuromoji integration (SegmentingTokenizerBase.java and its test) But we ended out adding a streaming viterbi search so we didnt need it anymore: http://svn.apache.org/viewvc?view=revision&revision=1230748
          Hide
          Robert Muir added a comment -

          I cut this over to ThaiTokenizer with that base class restored from Kuromoji. The tokenizer itself is simpler now. I think we can use the same approach with SmartChinese.

          Show
          Robert Muir added a comment - I cut this over to ThaiTokenizer with that base class restored from Kuromoji. The tokenizer itself is simpler now. I think we can use the same approach with SmartChinese.
          Hide
          Robert Muir added a comment -

          updated patch: I also cut over smartchinese to use this same approach while we are here.

          Show
          Robert Muir added a comment - updated patch: I also cut over smartchinese to use this same approach while we are here.
          Hide
          Ryan Ernst added a comment -

          +1, patch lgtm

          Is fixing Smart Chinese to not emit punctuation as simple as hardcoding the list of punctuation characters and skipping them in something like incrementWord()?

          Show
          Ryan Ernst added a comment - +1, patch lgtm Is fixing Smart Chinese to not emit punctuation as simple as hardcoding the list of punctuation characters and skipping them in something like incrementWord()?
          Hide
          Robert Muir added a comment -

          Its even simpler than that. But i wanted to do that in a followup issue. 4.8 is a good time to fix it, as its easy with this tokenizer!

          Show
          Robert Muir added a comment - Its even simpler than that. But i wanted to do that in a followup issue. 4.8 is a good time to fix it, as its easy with this tokenizer!
          Hide
          Simon Willnauer added a comment -

          I really like the base class! The patch LGTM +1 to commit

          Show
          Simon Willnauer added a comment - I really like the base class! The patch LGTM +1 to commit
          Hide
          ASF subversion and git services added a comment -

          Commit 1579846 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1579846 ]

          LUCENE-4984: Fix ThaiWordFilter, smartcn WordTokenFilter

          Show
          ASF subversion and git services added a comment - Commit 1579846 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1579846 ] LUCENE-4984 : Fix ThaiWordFilter, smartcn WordTokenFilter
          Hide
          ASF subversion and git services added a comment -

          Commit 1579853 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1579853 ]

          LUCENE-4984: actually pass down the AttributeFactory to superclass

          Show
          ASF subversion and git services added a comment - Commit 1579853 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1579853 ] LUCENE-4984 : actually pass down the AttributeFactory to superclass
          Hide
          ASF subversion and git services added a comment -

          Commit 1579855 from Robert Muir in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1579855 ]

          LUCENE-4984: Fix ThaiWordFilter, smartcn WordTokenFilter

          Show
          ASF subversion and git services added a comment - Commit 1579855 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1579855 ] LUCENE-4984 : Fix ThaiWordFilter, smartcn WordTokenFilter
          Hide
          Uwe Schindler added a comment -

          Close issue after release of 4.8.0

          Show
          Uwe Schindler added a comment - Close issue after release of 4.8.0

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development