Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5753

Refresh UAX29URLEmailTokenizer's TLD list

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.1, master (8.0)
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      uax_url_email analyzer appears unable to recognize the ".local" TLD among others. Bug can be reproduced by

      curl -XGET "ADDRESS/INDEX/_analyze?text=First%20Last%20lname@section.mycorp.local&pretty&analyzer=uax_url_email"

      will parse "lname@section.my" and "corp.local" as separate tokens, as opposed to

      curl -XGET "ADDRESS/INDEX/_analyze?text=First%20Last%20lname@section.mycorp.org&pretty&analyzer=uax_url_email"

      which will recognize "lname@section.mycorp.org".

      Can this be fixed by updating to a newer version? I am running ElasticSearch 0.90.5 and whatever Lucene version sits underneath that. My suspicion is that the TLD list the analyzer relies on (http://www.internic.net/zones/root.zone, I think?) is incomplete and needs updating.

        Attachments

          Activity

            People

            • Assignee:
              steve_rowe Steve Rowe
              Reporter:
              stevemer Steve Merritt
            • Votes:
              12 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: