Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9754

ICU Tokenizer: letter-space-number-letter tokenized inconsistently

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 7.5
    • None
    • core/search
    • None
    • Tested most recently on Elasticsearch 6.5.4.

    • New

    Description

      The tokenization of strings like 14th with the ICU tokenizer is affected by the character that comes before preceeding whitespace.

      For example, x 14th is tokenized as x | 14th; ァ 14th is tokenized as ァ | 14 | th.

      In general, in a letter-space-number-letter sequence, if the writing system before the space is the same as the writing system after the number, then you get two tokens. If the writing systems differ, you get three tokens.

      If the conditions are just right, the chunking that the ICU tokenizer does (trying to split on spaces to create <4k chunks) can create an artificial boundary between the tokens (e.g., between and 14th) and prevent the unexpected split of the second token (14th). Because chunking changes can ripple through a long document, editing text or the effects of a character filter can cause changes in tokenization thousands of lines later in a document. (This inconsistency was included as a side issue that I thought might add more weight to the main problem I am concerned with, but it seems to be more of a distraction. Chunking issues should perhaps be addressed in a different ticket, so I'm striking it out.)

      My guess is that some "previous character set" flag is not reset at the space, and numbers are not in a character set, so t is compared to  and they are not the same—causing a token split at the character set change—but I'm not sure.

       

      Attachments

        1. LUCENE-9754_prototype.patch
          11 kB
          Robert Muir

        Activity

          People

            Unassigned Unassigned
            Trey Jones Trey Jones
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: