[LUCENE-9112] SegmentingTokenizerBase splits terms that occupy 1024th positions in text - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 9.0
Fix Version/s: None
Component/s: modules/analysis
Labels:
- opennlp

Lucene Fields:

New

Description

The OpenNLP tokenizer show weird behaviour when text contains spurious punctuation such as having triple dots trailing a sentence...

the first dot becomes part of the token, having 'sentence.' becomes the token
much further down the text, a seemingly unrelated token is then suddenly split up, in my example (see attached unit test) the name 'Baron' is split into 'Baro' and 'n', this is the real problem

The problems never seem to occur when using small texts in unit tests but it certainly does in real world examples. Depending on how many 'spurious' dots, a completely different term can become split, or the same term in just a different location.

I am not too sure if this is actually a problem in the Lucene code, but it is a problem and i have a Lucene unit test proving the problem.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-9112.patch
30/Jan/20 12:09
6 kB
Markus Jelsma
LUCENE-9112.patch
30/Jan/20 10:58
6 kB
Markus Jelsma
en-token.bin
29/Jan/20 11:21
583 kB
Markus Jelsma
en-sent.bin
29/Jan/20 11:21
35 kB
Markus Jelsma
LUCENE-9112-unittest.patch
31/Dec/19 13:22
4 kB
Markus Jelsma
LUCENE-9112-unittest.patch
30/Dec/19 17:27
4 kB
Markus Jelsma

Activity

People

Assignee:: Unassigned

Reporter:: Markus Jelsma

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/Dec/19 17:26

Updated:: 28/Aug/22 15:55