[LUCENE-8548] Reevaluate scripts boundary break in Nori's tokenizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 7.7, 8.0
Component/s: None
Labels:
None

Lucene Fields:

New, Patch Available

Description

This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:

Tokens are split on different character POS types (which seem to not quite line up with Unicode character blocks), which leads to weird results for non-CJK tokens:

εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other symbol) + μί/SL(Foreign language)
ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + k/SL(Foreign language) + ̚/SY(Other symbol)
Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + лтичко/SL(Foreign language) + ̄/SY(Other symbol)
don't is tokenized as don + t; same for don't (with a curly apostrophe).
אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
While it is still possible to find these words using Nori, there are many more chances for false positives when the tokens are split up like this. In particular, individual numbers and combining diacritics are indexed separately (e.g., in the Cyrillic example above), which can lead to a performance hit on large corpora like Wiktionary or Wikipedia.

Work around: use a character filter to get rid of combining diacritics before Nori processes the text. This doesn't solve the Greek, Hebrew, or English cases, though.

Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. Combining diacritics should not trigger token splits. Non-CJK text should be tokenized on spaces and punctuation, not by character type shifts. Apostrophe-like characters should not trigger token splits (though I could see someone disagreeing on this one).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-8548.patch
27/Nov/18 19:06
6 kB
Jim Ferenczi
screenshot-1.png
23/Nov/18 16:30
33 kB
Christophe Bismuth
testCyrillicWord.dot.png
21/Nov/18 15:03
18 kB
Christophe Bismuth

Issue Links

links to

GitHub Pull Request #505

Activity

People

Assignee:: Unassigned

Reporter:: Jim Ferenczi

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 26/Oct/18 08:15

Updated:: 28/Aug/22 15:37

Resolved:: 03/Dec/18 10:15

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h