Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1556

some valid email address characters not correctly recognized

Details

    • Bug
    • Status: Closed
    • Trivial
    • Resolution: Fixed
    • 2.4.1
    • 3.1, 4.0-ALPHA
    • modules/analysis
    • None
    • New

    Description

      the EMAIL expression in StandardTokenizerImpl.jflex misses some unusual but valid characters in the left-hand-side of the email address. This causes an address to be broken into several tokens, for example:

      somename+site@gmail.com gets broken into "somename" and "site@gmail.com"
      husband&wife@talktalk.net gets broken into "husband" and "wife@talktalk.net"

      These seem to be occurring more often. The first seems to be because of an anti-spam trick you can use with google (see: http://labnol.blogspot.com/2007/08/gmail-plus-smart-trick-to-find-block.html). I see the second in several domains but a disproportionate amount are from talktalk.net, so I expect it's a signup suggestion from the service.

      Perhaps a fix would be to change line 102 of StandardTokenizerImpl.jflex from:
      EMAIL =

      {ALPHANUM} (("."|"-"|"_") {ALPHANUM}

      )* "@"

      {ALPHANUM} (("."|"-") {ALPHANUM}

      )+

      to

      EMAIL =

      {ALPHANUM} (("."|"-"|"_"|"+"|"&") {ALPHANUM}

      )* "@"

      {ALPHANUM} (("."|"-") {ALPHANUM}

      )+

      I'm aware that the StandardTokenizer is meant to be more of a basic implementation rather than an implementation the full standard, but it is quite useful in places and hopefully this would improve it slightly.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              paulnilsson Paul Nilsson
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: