[LUCENE-2404] Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.1, 4.0-ALPHA
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New, Patch Available

Description

The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.

The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-2404.patch
19/Apr/10 17:39
4 kB
Uwe Schindler
LUCENE-2404.patch
19/Apr/10 18:00
5 kB
Uwe Schindler
LUCENE-2404-2.patch
19/Apr/10 18:24
6 kB
Uwe Schindler
LUCENE-2404-2.patch
19/Apr/10 20:33
7 kB
Uwe Schindler

Activity

People

Assignee:: Uwe Schindler

Reporter:: Uwe Schindler

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 19/Apr/10 17:38

Updated:: 28/Aug/22 12:24

Resolved:: 19/Apr/10 20:58