Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2404

Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 3.1, 4.0-ALPHA
    • modules/analysis
    • None
    • New, Patch Available

    Description

      The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
      As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.

      The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.

      Attachments

        1. LUCENE-2404.patch
          4 kB
          Uwe Schindler
        2. LUCENE-2404.patch
          5 kB
          Uwe Schindler
        3. LUCENE-2404-2.patch
          6 kB
          Uwe Schindler
        4. LUCENE-2404-2.patch
          7 kB
          Uwe Schindler

        Activity

          People

            uschindler Uwe Schindler
            uschindler Uwe Schindler
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: