Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8044

UAX_URL_EMAIL tokenizer not compliant to rfc1808

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 6.6
    • Fix Version/s: None
    • Component/s: core/other
    • Labels:
      None
    • Environment:
    • Lucene Fields:
      New

      Description

      I noticed that the uax_url_email tokenizer splits urls in multiple tokens in presence of digits, ".", "-"

      I opened a issue on elasticsearch github repo (https://github.com/elastic/elasticsearch/issues/27309) because I noticed this strange behaviour.

      Their answer was

      The uax_url_email tokenizer tokenizes URLs and email addresses, but in order to recognize a token as a URL it must include the scheme (usually HTTP:// or https://):
      Additionally, this tokenizer belongs to Lucene. Could you open this issue at https://lucene.apache.org/core/ instead?

      URLs are defined by RFC1738 and extended by RFC1808.
      In RFC1808 Relative URLs are explained, and this allows scheme-less URLs.
      I would expect uax_url_email to tokenize correctly also scheme-less and relative URL.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              mellowonpsx Sergio Leoni

              Dates

              • Created:
                Updated:

                Issue deployment