Lucene - Core
  1. Lucene - Core
  2. LUCENE-5391

UAX29URLEmailTokenizer should not tokenize no-scheme domain-only URLs that are followed by an alphanumeric character

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The uax29urlemailtokenizer tokenises index2.php as:

      <URL> index2.ph
      <ALPHANUM> p

      While it does not do the same for index.php

      Screenshot from analyser: http://postimg.org/image/aj6c98n3b/

      1. LUCENE-5391.patch
        828 kB
        Steve Rowe
      2. LUCENE-5391.patch
        780 kB
        Steve Rowe

        Activity

        Hide
        Steve Rowe added a comment - - edited

        I understand why "index.php" is not broken up: the <URL> rule matches "index.ph", but the <ALPHANUM> rule has a longer match, so it wins.

        Conversely, <ALPHANUM> does not match "index2.php" (likely because the [number][period] sequence is not allowed), so the shorter <URL> match is tokenized.

        Another improperly broken-up filename-looking thing: "index-h.php" - the <URL> rule matches "index-h.ph", but the <ALPHANUM> rule doesn't match (likely because of the hyphen).

        I think the fix here is to disallow <URL>s when there is no trailing port, path, query or fragment, and the following character is [-A-Za-z0-9] (allowable domain label characters).

        I'll make a patch.

        Show
        Steve Rowe added a comment - - edited I understand why "index.php" is not broken up: the <URL> rule matches "index.ph", but the <ALPHANUM> rule has a longer match, so it wins. Conversely, <ALPHANUM> does not match "index2.php" (likely because the [number] [period] sequence is not allowed), so the shorter <URL> match is tokenized. Another improperly broken-up filename-looking thing: "index-h.php" - the <URL> rule matches "index-h.ph", but the <ALPHANUM> rule doesn't match (likely because of the hyphen). I think the fix here is to disallow <URL>s when there is no trailing port, path, query or fragment, and the following character is [-A-Za-z0-9] (allowable domain label characters). I'll make a patch.
        Hide
        Steve Rowe added a comment -

        Patch fixing the bug, with some added tests.

        Committing shortly.

        Show
        Steve Rowe added a comment - Patch fixing the bug, with some added tests. Committing shortly.
        Hide
        Steve Rowe added a comment -

        On second thought, I think no-scheme domain-only URL recognition should be blocked by any following alphanumeric character, not just a domain label character. New patch with implementation.

        Show
        Steve Rowe added a comment - On second thought, I think no-scheme domain-only URL recognition should be blocked by any following alphanumeric character, not just a domain label character. New patch with implementation.
        Hide
        ASF subversion and git services added a comment -

        Commit 1557042 from Steve Rowe in branch 'dev/trunk'
        [ https://svn.apache.org/r1557042 ]

        LUCENE-5391: UAX29URLEmailTokenizer should not tokenize no-scheme domain-only URLs that are followed by an alphanumeric character

        Show
        ASF subversion and git services added a comment - Commit 1557042 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1557042 ] LUCENE-5391 : UAX29URLEmailTokenizer should not tokenize no-scheme domain-only URLs that are followed by an alphanumeric character
        Hide
        ASF subversion and git services added a comment -

        Commit 1557046 from Steve Rowe in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1557046 ]

        LUCENE-5391: UAX29URLEmailTokenizer should not tokenize no-scheme domain-only URLs that are followed by an alphanumeric character (merged trunk r1557042)

        Show
        ASF subversion and git services added a comment - Commit 1557046 from Steve Rowe in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1557046 ] LUCENE-5391 : UAX29URLEmailTokenizer should not tokenize no-scheme domain-only URLs that are followed by an alphanumeric character (merged trunk r1557042)
        Hide
        Steve Rowe added a comment -

        Committed to trunk and branch_4x.

        Thanks Chris!

        Show
        Steve Rowe added a comment - Committed to trunk and branch_4x. Thanks Chris!

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Chris Geeringh
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development