[LUCENE-3940] When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.0-ALPHA, 3.6.1
Component/s: None
Labels:
None

Lucene Fields:

New

Description

I modified BaseTokenStreamTestCase to assert that the start/end
offsets match for graph (posLen > 1) tokens, and this caught a bug in
Kuromoji when the decompounding of a compound token has a punctuation
token that's dropped.

In this case we should leave hole(s) so that the graph is intact, ie,
the graph should look the same as if the punctuation tokens were not
initially removed, but then a StopFilter had removed them.

This also affects tokens that have no compound over them, ie we fail
to leave a hole today when we remove the punctuation tokens.

I'm not sure this is serious enough to warrant fixing in 3.6 at the
last minute...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-3940.patch
31/Mar/12 00:19
56 kB
Michael McCandless
LUCENE-3940.patch
31/Mar/12 09:56
67 kB
Michael McCandless
LUCENE-3940.patch
01/Apr/12 14:01
70 kB
Michael McCandless
LUCENE-3940.patch
01/Apr/12 17:28
67 kB
Michael McCandless

Activity

People

Assignee:: Michael McCandless

Reporter:: Michael McCandless

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 31/Mar/12 00:06

Updated:: 28/Aug/22 13:13

Resolved:: 08/Apr/12 19:36