Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2763

Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      Currently, in addition to implementing the UAX#29 word boundary rules, StandardTokenizer recognizes email adresses and URLs, but doesn't provide a way to turn this behavior off and/or provide overlapping tokens with the components (username from email address, hostname from URL, etc.).

      UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer should be renamed to something like UAX29TokenizerPlusPlus (or something like that).

      For rationale, see the discussion at the reopened LUCENE-2167.

        Attachments

        1. LUCENE-2763.patch
          373 kB
          Steven Rowe
        2. LUCENE-2763.patch
          369 kB
          Steven Rowe
        3. LUCENE-2763.patch
          367 kB
          Steven Rowe

        Issue Links

          Activity

            People

            • Assignee:
              sarowe Steven Rowe
              Reporter:
              sarowe Steven Rowe

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment