Lucene - Core
  1. Lucene - Core
  2. LUCENE-3880

UAX29URLEmailTokenizer fails to recognize emails as such when the mailto: scheme is prepended

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.5, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      As reported by Kai Gülzau on solr-user:

      UAX29URLEmailTokenizer seems to split at the wrong place:

      mailto:test@example.org

      ->

      mailto:test
      example.org

      As a workaround I use

      <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="mailto:" replacement="mailto: "/>
      

        Activity

        Hide
        Kai Gülzau added a comment -

        That was fast! Thank you

        Show
        Kai Gülzau added a comment - That was fast! Thank you
        Hide
        Steve Rowe added a comment -

        Committed to trunk and branch_3x.

        Thanks Kai!

        Show
        Steve Rowe added a comment - Committed to trunk and branch_3x. Thanks Kai!
        Hide
        Steve Rowe added a comment -

        Can we maybe (as discussed before) also add a corresponding Analyzer (clone of StandardAna)?

        +1

        Show
        Steve Rowe added a comment - Can we maybe (as discussed before) also add a corresponding Analyzer (clone of StandardAna)? +1
        Hide
        Uwe Schindler added a comment -

        Can we maybe (as discussed before) also add a corresponding Analyzer (clone of StandardAna)?

        Show
        Uwe Schindler added a comment - Can we maybe (as discussed before) also add a corresponding Analyzer (clone of StandardAna)?
        Hide
        Steve Rowe added a comment -

        Patch, adding a test for the triggering example, and another test illustrating some of the challenges of handling full mailto: syntax.

        This change triggers a new version for UAX29URLEmailTokenizer, and I've taken advantage of that to update to the most recent top level domain definitions.

        I think this is ready to commit.

        Show
        Steve Rowe added a comment - Patch, adding a test for the triggering example, and another test illustrating some of the challenges of handling full mailto: syntax. This change triggers a new version for UAX29URLEmailTokenizer, and I've taken advantage of that to update to the most recent top level domain definitions. I think this is ready to commit.
        Hide
        Steve Rowe added a comment -

        RFC 2368 describes URLs employing the mailto: scheme; this RFC has been obsoleted by RFC 6068, which describes the mailto: URI scheme.

        mailto: URIs can contain multiple email addresses, and fielded information including CC, BCC, Subject, and Body - in short, the entire contents of an email message.

        However, a significant proportion of (probably most) mailto: URIs in the wild contain just a single email address. Short of handling all aspects of the mailto: scheme (out of scope for this issue), I think it would be useful to employ a trick similar to the charFilter hack described by Kai Gülzau: explicitly split "mailto:" off from a following email address, allowing the email address to be recognized as such.

        Show
        Steve Rowe added a comment - RFC 2368 describes URLs employing the mailto: scheme; this RFC has been obsoleted by RFC 6068 , which describes the mailto: URI scheme. mailto: URIs can contain multiple email addresses, and fielded information including CC, BCC, Subject, and Body - in short, the entire contents of an email message. However, a significant proportion of (probably most) mailto: URIs in the wild contain just a single email address. Short of handling all aspects of the mailto: scheme (out of scope for this issue), I think it would be useful to employ a trick similar to the charFilter hack described by Kai Gülzau: explicitly split "mailto:" off from a following email address, allowing the email address to be recognized as such.

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Steve Rowe
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development