Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2763

Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      Currently, in addition to implementing the UAX#29 word boundary rules, StandardTokenizer recognizes email adresses and URLs, but doesn't provide a way to turn this behavior off and/or provide overlapping tokens with the components (username from email address, hostname from URL, etc.).

      UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer should be renamed to something like UAX29TokenizerPlusPlus (or something like that).

      For rationale, see the discussion at the reopened LUCENE-2167.

        Attachments

        1. LUCENE-2763.patch
          367 kB
          Steve Rowe
        2. LUCENE-2763.patch
          369 kB
          Steve Rowe
        3. LUCENE-2763.patch
          373 kB
          Steve Rowe

          Issue Links

            Activity

              People

              • Assignee:
                steve_rowe Steve Rowe
                Reporter:
                steve_rowe Steve Rowe
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: