Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2763

Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.1, 4.0-ALPHA
    • 3.1, 4.0-ALPHA
    • modules/analysis
    • None

    Description

      Currently, in addition to implementing the UAX#29 word boundary rules, StandardTokenizer recognizes email adresses and URLs, but doesn't provide a way to turn this behavior off and/or provide overlapping tokens with the components (username from email address, hostname from URL, etc.).

      UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer should be renamed to something like UAX29TokenizerPlusPlus (or something like that).

      For rationale, see the discussion at the reopened LUCENE-2167.

      Attachments

        1. LUCENE-2763.patch
          373 kB
          Steven Rowe
        2. LUCENE-2763.patch
          369 kB
          Steven Rowe
        3. LUCENE-2763.patch
          367 kB
          Steven Rowe

        Issue Links

          Activity

            People

              sarowe Steven Rowe
              sarowe Steven Rowe
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: