Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2747

Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific analyzers, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1.

      Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.

      For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

        Attachments

        1. LUCENE-2747.patch
          15 kB
          Robert Muir
        2. LUCENE-2747.patch
          15 kB
          Robert Muir

          Activity

            People

            • Assignee:
              rcmuir Robert Muir
              Reporter:
              steve_rowe Steve Rowe
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: