Lucene - Core
  1. Lucene - Core
  2. LUCENE-2763

Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      Currently, in addition to implementing the UAX#29 word boundary rules, StandardTokenizer recognizes email adresses and URLs, but doesn't provide a way to turn this behavior off and/or provide overlapping tokens with the components (username from email address, hostname from URL, etc.).

      UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer should be renamed to something like UAX29TokenizerPlusPlus (or something like that).

      For rationale, see the discussion at the reopened LUCENE-2167.

      1. LUCENE-2763.patch
        373 kB
        Steve Rowe
      2. LUCENE-2763.patch
        369 kB
        Steve Rowe
      3. LUCENE-2763.patch
        367 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          Steve Rowe added a comment -

          Patch to perform the switch on trunk.

          Show
          Steve Rowe added a comment - Patch to perform the switch on trunk.
          Hide
          Robert Muir added a comment -

          +1, looks good to me.

          Show
          Robert Muir added a comment - +1, looks good to me.
          Hide
          Steve Rowe added a comment -

          Updated patch to fix solr/CHANGES.txt, lucene/CHANGES.txt, and analysis/standard/READ_BEFORE_REGENERATING.txt.

          I will commit later today if there are no objections.

          Show
          Steve Rowe added a comment - Updated patch to fix solr/CHANGES.txt , lucene/CHANGES.txt , and analysis/standard/READ_BEFORE_REGENERATING.txt . I will commit later today if there are no objections.
          Hide
          Steve Rowe added a comment -

          Final patch, with URL and E-mail tokenization tests added to Solr's TestUAX29URLEmailTokenizerFactory.

          I will commit shortly.

          Show
          Steve Rowe added a comment - Final patch, with URL and E-mail tokenization tests added to Solr's TestUAX29URLEmailTokenizerFactory . I will commit shortly.
          Hide
          Steve Rowe added a comment -

          Committed to trunk revision 1043071, branch_3x revision 1043180

          Show
          Steve Rowe added a comment - Committed to trunk revision 1043071, branch_3x revision 1043180
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Steve Rowe
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development