Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9370

RegExpQuery should error for inappropriate use of \ character in input

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 9.0
    • 9.0
    • core/search
    • None
    • New

    Description

      The RegExp class is too lenient in parsing user input which can confuse or mislead users and cause backwards compatibility issues as we enhance regex support.

      In normal regular expression syntax the backslash is used to:

      • escape a reserved character like \.
      • use certain unreserved characters in a shorthand context e.g. \d means digits [0-9]

      The leniency bug in RegExp is that it adds an extra rule to this list - any backslashed characters that don't satisfy the above rules are taken literally. For example, there's no reason to put a backslash in front of the letter "p" but we accept \p as the letter p.

      Java's Pattern class will throw a parse exception given a meaningless backslash like \p.
      We should too.

      In Lucene-9336 we added support for commonly supported regex expressions like `\d`. Sadly this is a breaking change because of the leniency that has allowed \d to be accepted as the letter d without an exception. Users were likely silently missing results they were hoping for and we made a BWC problem for ourselves in filling in the gaps.

      I propose we do like other RegEx parsers and error on inappropriate use of backslashes.
      This will be another breaking change so should target 9.0

      Attachments

        Activity

          People

            Unassigned Unassigned
            mharwood Mark Harwood
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: