[LUCENE-8044] UAX_URL_EMAIL tokenizer not compliant to rfc1808 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 6.6
Fix Version/s: None
Component/s: core/other
Labels:
None
Environment:

Hide

Elasticsearch 5.5.2, Build: b2f0c09/2017-08-14T12:33:14.154Z, JVM: 1.8.0_144

JVM java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

OS Linux 3.10.0-514.10.2.el7.x86_64 #1 SMP Mon Feb 20 02:37:52 EST 2017 x86_64 x86_64 x86_64 GNU/Linux

Show
Elasticsearch 5.5.2, Build: b2f0c09/2017-08-14T12:33:14.154Z, JVM: 1.8.0_144 JVM java version "1.8.0_144" Java(TM) SE Runtime Environment (build 1.8.0_144-b01) Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) OS Linux 3.10.0-514.10.2.el7.x86_64 #1 SMP Mon Feb 20 02:37:52 EST 2017 x86_64 x86_64 x86_64 GNU/Linux

Lucene Fields:

New

Description

I noticed that the uax_url_email tokenizer splits urls in multiple tokens in presence of digits, ".", "-"

I opened a issue on elasticsearch github repo (https://github.com/elastic/elasticsearch/issues/27309) because I noticed this strange behaviour.

Their answer was

The uax_url_email tokenizer tokenizes URLs and email addresses, but in order to recognize a token as a URL it must include the scheme (usually HTTP:// or https://):
Additionally, this tokenizer belongs to Lucene. Could you open this issue at https://lucene.apache.org/core/ instead?

URLs are defined by RFC1738 and extended by RFC1808.
In RFC1808 Relative URLs are explained, and this allows scheme-less URLs.
I would expect uax_url_email to tokenize correctly also scheme-less and relative URL.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Sergio Leoni

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Nov/17 21:59

Updated:: 28/Aug/22 15:21