Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7181

JapaneseTokenizer: Validate segmentation of User Dictionary entries on creation

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • New

    Description

      From the conversation on the dev list

      The user dictionary in the JapaneseTokenizer allows users to customize how a stream is broken into tokens using a specific set of rules provided like:
      AABBBCC -> AA BBB CC

      It does not allow users to change any of the token characters like:

      (1) AABBBCC -> DD BBB CC (this will just tokenize to "AA", "BBB", "CC", seems to only care about positions)
      It also doesn't let a character be part of more than one token, like:

      (2) AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)

      ..or make the output token bigger than the input text:

      (3) AA -> AAA (Also AIOOBE)

      Currently there is no validation for those cases, case 1 doesn't fail but provide unexpected tokens. Cases 2 and 3 fail when the input text is analyzed. We should add validation to the UserDictionary creation.

      Attachments

        1. LUCENE-7181.patch
          3 kB
          Tomas Eduardo Fernandez Lobbe

        Activity

          People

            cm Christian Moen
            tflobbe Tomas Eduardo Fernandez Lobbe
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: