Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8125

emoji sequence support in ICUTokenizer

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: trunk, 7.4
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      uax29 word break rules already know how to handle these correctly, we just need to assign them a token type.

      This is better than users trying to do this with custom rules (e.g. LUCENE-7916) because they are script-independent (common/inherited).

        Attachments

        1. LUCENE-8125.patch
          27 kB
          Robert Muir
        2. LUCENE-8125.patch
          19 kB
          Robert Muir
        3. LUCENE-8125.patch
          20 kB
          Robert Muir
        4. LUCENE-8125.patch
          19 kB
          Robert Muir
        5. LUCENE-8125.patch
          17 kB
          Robert Muir

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                rcmuir Robert Muir
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: