Details
-
Bug
-
Status: Closed
-
Trivial
-
Resolution: Fixed
-
2.4.1
-
None
-
New
Description
the EMAIL expression in StandardTokenizerImpl.jflex misses some unusual but valid characters in the left-hand-side of the email address. This causes an address to be broken into several tokens, for example:
somename+site@gmail.com gets broken into "somename" and "site@gmail.com"
husband&wife@talktalk.net gets broken into "husband" and "wife@talktalk.net"
These seem to be occurring more often. The first seems to be because of an anti-spam trick you can use with google (see: http://labnol.blogspot.com/2007/08/gmail-plus-smart-trick-to-find-block.html). I see the second in several domains but a disproportionate amount are from talktalk.net, so I expect it's a signup suggestion from the service.
Perhaps a fix would be to change line 102 of StandardTokenizerImpl.jflex from:
EMAIL =
)* "@"
{ALPHANUM} (("."|"-") {ALPHANUM})+
to
EMAIL =
{ALPHANUM} (("."|"-"|"_"|"+"|"&") {ALPHANUM})* "@"
{ALPHANUM} (("."|"-") {ALPHANUM})+
I'm aware that the StandardTokenizer is meant to be more of a basic implementation rather than an implementation the full standard, but it is quite useful in places and hopefully this would improve it slightly.
Attachments
Issue Links
- is part of
-
LUCENE-2167 Implement StandardTokenizer with the UAX#29 Standard
- Closed