We are using the UAX29URLEmailTokenizer so we can use the token types in our plugins.
However, I noticed that the tokenizer is not detecting certain URLs as <URL> but <ALPHANUM> instead.
Examples that are not working:
- example.com is <ALPHANUM>
- example.net is <ALPHANUM>
- https://example.com is <URL>
- as is https://example.net
Examples that work:
- example.ch is <URL>
- example.co.uk is <URL>
- example.nl is <URL>
I have checked this JIRA, and could not find an issue. I have tested this on Lucene (Solr) 6.4.1 and 7.3.
Could someone confirm my findings and advise what I could do to (help) resolve this issue?
- is broken by
LUCENE-5391 UAX29URLEmailTokenizer should not tokenize no-scheme domain-only URLs that are followed by an alphanumeric character