Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      ThaiWordFilter is an offender in TestRandomChains because it creates positions and updates offsets.

      1. LUCENE-4984.patch
        49 kB
        Robert Muir
      2. LUCENE-4984.patch
        32 kB
        Robert Muir
      3. LUCENE-4984.patch
        5 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          jpountz Adrien Grand added a comment -

          Patch:

          • ThaiWordFilter does not update offsets anymore,
          • and emits all tokens generated from the same input token at the same position.
          Show
          jpountz Adrien Grand added a comment - Patch: ThaiWordFilter does not update offsets anymore, and emits all tokens generated from the same input token at the same position.
          Hide
          rcmuir Robert Muir added a comment -

          I think this should be a tokenizer.

          Show
          rcmuir Robert Muir added a comment - I think this should be a tokenizer.
          Hide
          jpountz Adrien Grand added a comment -

          Good point, I'll update the patch to create a ThaiTokenizer so that we can just completely deprecate this filter.

          Show
          jpountz Adrien Grand added a comment - Good point, I'll update the patch to create a ThaiTokenizer so that we can just completely deprecate this filter.
          Hide
          rcmuir Robert Muir added a comment -

          tokenizing from a breakiterator can get a little tricky.

          we had some support for this (it should be re-reviewed) in the initial kuromoji integration (SegmentingTokenizerBase.java and its test)
          But we ended out adding a streaming viterbi search so we didnt need it anymore:

          http://svn.apache.org/viewvc?view=revision&revision=1230748

          Show
          rcmuir Robert Muir added a comment - tokenizing from a breakiterator can get a little tricky. we had some support for this (it should be re-reviewed) in the initial kuromoji integration (SegmentingTokenizerBase.java and its test) But we ended out adding a streaming viterbi search so we didnt need it anymore: http://svn.apache.org/viewvc?view=revision&revision=1230748
          Hide
          rcmuir Robert Muir added a comment -

          I cut this over to ThaiTokenizer with that base class restored from Kuromoji. The tokenizer itself is simpler now. I think we can use the same approach with SmartChinese.

          Show
          rcmuir Robert Muir added a comment - I cut this over to ThaiTokenizer with that base class restored from Kuromoji. The tokenizer itself is simpler now. I think we can use the same approach with SmartChinese.
          Hide
          rcmuir Robert Muir added a comment -

          updated patch: I also cut over smartchinese to use this same approach while we are here.

          Show
          rcmuir Robert Muir added a comment - updated patch: I also cut over smartchinese to use this same approach while we are here.
          Hide
          rjernst Ryan Ernst added a comment -

          +1, patch lgtm

          Is fixing Smart Chinese to not emit punctuation as simple as hardcoding the list of punctuation characters and skipping them in something like incrementWord()?

          Show
          rjernst Ryan Ernst added a comment - +1, patch lgtm Is fixing Smart Chinese to not emit punctuation as simple as hardcoding the list of punctuation characters and skipping them in something like incrementWord()?
          Hide
          rcmuir Robert Muir added a comment -

          Its even simpler than that. But i wanted to do that in a followup issue. 4.8 is a good time to fix it, as its easy with this tokenizer!

          Show
          rcmuir Robert Muir added a comment - Its even simpler than that. But i wanted to do that in a followup issue. 4.8 is a good time to fix it, as its easy with this tokenizer!
          Hide
          simonw Simon Willnauer added a comment -

          I really like the base class! The patch LGTM +1 to commit

          Show
          simonw Simon Willnauer added a comment - I really like the base class! The patch LGTM +1 to commit
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1579846 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1579846 ]

          LUCENE-4984: Fix ThaiWordFilter, smartcn WordTokenFilter

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1579846 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1579846 ] LUCENE-4984 : Fix ThaiWordFilter, smartcn WordTokenFilter
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1579853 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1579853 ]

          LUCENE-4984: actually pass down the AttributeFactory to superclass

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1579853 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1579853 ] LUCENE-4984 : actually pass down the AttributeFactory to superclass
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1579855 from Robert Muir in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1579855 ]

          LUCENE-4984: Fix ThaiWordFilter, smartcn WordTokenFilter

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1579855 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1579855 ] LUCENE-4984 : Fix ThaiWordFilter, smartcn WordTokenFilter
          Hide
          thetaphi Uwe Schindler added a comment -

          Close issue after release of 4.8.0

          Show
          thetaphi Uwe Schindler added a comment - Close issue after release of 4.8.0

            People

            • Assignee:
              jpountz Adrien Grand
              Reporter:
              jpountz Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development