Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5753

Refresh UAX29URLEmailTokenizer's TLD list

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 7.1, 8.0
    • modules/analysis
    • None
    • New

    Description

      uax_url_email analyzer appears unable to recognize the ".local" TLD among others. Bug can be reproduced by

      curl -XGET "ADDRESS/INDEX/_analyze?text=First%20Last%20lname@section.mycorp.local&pretty&analyzer=uax_url_email"

      will parse "lname@section.my" and "corp.local" as separate tokens, as opposed to

      curl -XGET "ADDRESS/INDEX/_analyze?text=First%20Last%20lname@section.mycorp.org&pretty&analyzer=uax_url_email"

      which will recognize "lname@section.mycorp.org".

      Can this be fixed by updating to a newer version? I am running ElasticSearch 0.90.5 and whatever Lucene version sits underneath that. My suspicion is that the TLD list the analyzer relies on (http://www.internic.net/zones/root.zone, I think?) is incomplete and needs updating.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            sarowe Steven Rowe
            stevemer Steve Merritt
            Votes:
            12 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment