Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2763

Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      Currently, in addition to implementing the UAX#29 word boundary rules, StandardTokenizer recognizes email adresses and URLs, but doesn't provide a way to turn this behavior off and/or provide overlapping tokens with the components (username from email address, hostname from URL, etc.).

      UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer should be renamed to something like UAX29TokenizerPlusPlus (or something like that).

      For rationale, see the discussion at the reopened LUCENE-2167.

        Attachments

        1. LUCENE-2763.patch
          367 kB
          Steven Rowe
        2. LUCENE-2763.patch
          369 kB
          Steven Rowe
        3. LUCENE-2763.patch
          373 kB
          Steven Rowe

          Issue Links

            Activity

              People

              • Assignee:
                sarowe Steven Rowe
                Reporter:
                sarowe Steven Rowe
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: