[LUCENE-8450] Enable TokenFilters to assign offsets when splitting tokens - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Lucene Fields:

New, Patch Available

Description

CharFilters and TokenFilters may alter token lengths, meaning that subsequent filters cannot perform simple arithmetic to calculate the original ("correct") offset of a character in the interior of the token. A similar situation exists for Tokenizers, but these can call CharFilter.correctOffset() to map offsets back to their original location in the input stream. There is no such API for TokenFilters.

This issue calls for adding an API to support use cases like highlighting the correct portion of a compound token. For example the german word "außerstand" (meaning afaict "unable to do something") will be decompounded and match "stand and "ausser", but as things are today, offsets are always set using the start and end of the tokens produced by Tokenizer, meaning that highlighters will match the entire compound.

I'm proposing to add this method to `TokenStream`:

referencing a CharOffsetMap with these methods:

int correctOffset(int currentOff);
int uncorrectOffset(int originalOff);

The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from original offset forward to the current "offset space".

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

offsets.patch
08/Aug/18 22:03
27 kB
Michael Sokolov

Activity

People

Assignee:: Unassigned

Reporter:: Michael Sokolov

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Aug/18 00:28

Updated:: 28/Aug/22 15:34