Lucene - Core
  1. Lucene - Core
  2. LUCENE-3880

UAX29URLEmailTokenizer fails to recognize emails as such when the mailto: scheme is prepended

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.5, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      As reported by Kai Gülzau on solr-user:

      UAX29URLEmailTokenizer seems to split at the wrong place:

      mailto:test@example.org

      ->

      mailto:test
      example.org

      As a workaround I use

      <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="mailto:" replacement="mailto: "/>
      

        Activity

        Steve Rowe created issue -
        Steve Rowe made changes -
        Field Original Value New Value
        Description As [reported by Kai Gülzau on solr-user|http://markmail.org/message/n32kji3okqm2c5qn]:

        UAX29URLEmailTokenizer seems to split at the wrong place:

        mailto:test@example.org ->
        mailto:test
        example.org

        As a workaround I use

        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="mailto:" replacement="mailto: "/>
        As [reported by Kai Gülzau on solr-user|http://markmail.org/message/n32kji3okqm2c5qn]:

        UAX29URLEmailTokenizer seems to split at the wrong place:

        {noformat}mailto:test@example.org{noformat} ->
        {noformat}mailto:test{noformat}
        {noformat}example.org{noformat}

        As a workaround I use

        {code:xml}
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="mailto:" replacement="mailto: "/>
        {code}
        Hide
        Steve Rowe added a comment -

        RFC 2368 describes URLs employing the mailto: scheme; this RFC has been obsoleted by RFC 6068, which describes the mailto: URI scheme.

        mailto: URIs can contain multiple email addresses, and fielded information including CC, BCC, Subject, and Body - in short, the entire contents of an email message.

        However, a significant proportion of (probably most) mailto: URIs in the wild contain just a single email address. Short of handling all aspects of the mailto: scheme (out of scope for this issue), I think it would be useful to employ a trick similar to the charFilter hack described by Kai Gülzau: explicitly split "mailto:" off from a following email address, allowing the email address to be recognized as such.

        Show
        Steve Rowe added a comment - RFC 2368 describes URLs employing the mailto: scheme; this RFC has been obsoleted by RFC 6068 , which describes the mailto: URI scheme. mailto: URIs can contain multiple email addresses, and fielded information including CC, BCC, Subject, and Body - in short, the entire contents of an email message. However, a significant proportion of (probably most) mailto: URIs in the wild contain just a single email address. Short of handling all aspects of the mailto: scheme (out of scope for this issue), I think it would be useful to employ a trick similar to the charFilter hack described by Kai Gülzau: explicitly split "mailto:" off from a following email address, allowing the email address to be recognized as such.
        Hide
        Steve Rowe added a comment -

        Patch, adding a test for the triggering example, and another test illustrating some of the challenges of handling full mailto: syntax.

        This change triggers a new version for UAX29URLEmailTokenizer, and I've taken advantage of that to update to the most recent top level domain definitions.

        I think this is ready to commit.

        Show
        Steve Rowe added a comment - Patch, adding a test for the triggering example, and another test illustrating some of the challenges of handling full mailto: syntax. This change triggers a new version for UAX29URLEmailTokenizer, and I've taken advantage of that to update to the most recent top level domain definitions. I think this is ready to commit.
        Steve Rowe made changes -
        Attachment LUCENE-3880.patch [ 12518838 ]
        Hide
        Uwe Schindler added a comment -

        Can we maybe (as discussed before) also add a corresponding Analyzer (clone of StandardAna)?

        Show
        Uwe Schindler added a comment - Can we maybe (as discussed before) also add a corresponding Analyzer (clone of StandardAna)?
        Hide
        Steve Rowe added a comment -

        Can we maybe (as discussed before) also add a corresponding Analyzer (clone of StandardAna)?

        +1

        Show
        Steve Rowe added a comment - Can we maybe (as discussed before) also add a corresponding Analyzer (clone of StandardAna)? +1
        Hide
        Steve Rowe added a comment -

        Committed to trunk and branch_3x.

        Thanks Kai!

        Show
        Steve Rowe added a comment - Committed to trunk and branch_3x. Thanks Kai!
        Steve Rowe made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Lucene Fields New [ 10121 ] New,Patch Available [ 10121,10120 ]
        Fix Version/s 3.6 [ 12319070 ]
        Fix Version/s 4.0 [ 12314025 ]
        Resolution Fixed [ 1 ]
        Hide
        Kai Gülzau added a comment -

        That was fast! Thank you

        Show
        Kai Gülzau added a comment - That was fast! Thank you
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        13h 46m 1 Steve Rowe 19/Mar/12 04:16
        Resolved Resolved Closed Closed
        417d 6h 27m 1 Uwe Schindler 10/May/13 11:43

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Steve Rowe
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development