[LUCENE-6595] CharFilter offsets correction is wonky - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726

If I make a MappingCharFilter with these mappings:

  ( -> 
  ) ->

i.e., just erase left and right paren, then tokenizing the string
"(F31)" with e.g. WhitespaceTokenizer, produces a single token F31,
with start offset 1 (good).

But for its end offset, I would expect/want 4, but it produces 5
today.

This can be easily explained given how the mapping works: each time a
mapping rule matches, we update the cumulative offset difference,
conceptually as an array like this (it's encoded more compactly):

  Output offset: 0 1 2 3
   Input offset: 1 2 3 5

When the tokenizer produces F31, it assigns it startOffset=0 and
endOffset=3 based on the characters it sees (F, 3, 1). It then asks
the CharFilter to correct those offsets, mapping them backwards
through the above arrays, which creates startOffset=1 (good) and
endOffset=5 (bad).

At first, to fix this, I thought this is an "off-by-1" and when
correcting the endOffset we really should return
1+correct(outputEndOffset-1), which would return the correct value (4)
here.

But that's too naive, e.g. here's another example:

  cccc -> cc

If I then tokenize cccc, today we produce the correct offsets (0, 4)
but if we do this "off-by-1" fix for endOffset, we would get the wrong
endOffset (2).

I'm not sure what to do here...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-6595.patch
14/Jul/15 11:28
24 kB
Cao Manh Dat
LUCENE-6595.patch
08/Jul/15 14:41
12 kB
Cao Manh Dat
LUCENE-6595.patch
23/Jun/15 16:38
12 kB
Cao Manh Dat
LUCENE-6595.patch
21/Jun/15 16:14
11 kB
Cao Manh Dat
Lucene-6595.pptx
09/Jul/15 03:39
73 kB
Cao Manh Dat

Issue Links

is related to

LUCENE-5734 HTMLStripCharFilter end offset should be left of closing tags

Open

Activity

People

Assignee:: Unassigned

Reporter:: Michael McCandless

Votes:: 2 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 20/Jun/15 10:11

Updated:: 28/Aug/22 14:36