Xerces2-J
  1. Xerces2-J
  2. XERCESJ-1389

RegEx matching: ranges not computed correctly in "ignore case" mode

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9.1
    • Fix Version/s: 2.10.0
    • Component/s: Other
    • Labels:
      None

      Description

      There are a couple of problems in interpreting character ranges in "case-insensitive" mode.

      When doing range subtraction (or negation), all the case-variants of the subtracted characters need to be considered. For example, "[^Q]" means, in case-insensitive mode, "any character except 'q' or 'Q'" but the regex engine matches both 'q' and 'Q' in this example.

      Also, in case-insensitive mode, all character classes must stay the same, so for example "\p

      {Lu}

      " would not match a lowercase letter, but the regex engine matches 'q'.

        Activity

        Hide
        Josh Spiegel added a comment -

        Does anybody know if/when this will be fixed?

        Show
        Josh Spiegel added a comment - Does anybody know if/when this will be fixed?
        Hide
        Michael Glavassevich added a comment -

        I believe Khaled fixed the issues with case insensitive matching today in SVN rev 831926. Please verify.

        Show
        Michael Glavassevich added a comment - I believe Khaled fixed the issues with case insensitive matching today in SVN rev 831926. Please verify.
        Hide
        Josh Spiegel added a comment -

        First of all, thanks for fixing this bug.

        I was looking at this fix (831926) and I think there may be a problem but I am not positive. I apologize in advance if I am mistaken.

        When interpreting a case insensitive range, the code seems to add the lower and upper case of each character in the range. (see the new RegexParser.addCaseInsensitiveChar and RegexParser.addCaseInsensitiveCharRange). However, it is my understanding that not all character case mappings in unicode are invertible like this (http://unicode.org/faq/casemap_charprop.html#2)

        For example both capital K and the kelvin sign have a lower-case of 'k':
        lower-case(['K' - 0x004B]) == 'k'
        AND
        lower-case([Kelvin-sign - 0x212A]) == 'k'

        So, if I have a regular expression 'k', in case insensitive mode shouldn't this match both 'K' and the Kelvin-sign? Currently it seems it would only match 'k' or 'K'.

        Thanks.

        Show
        Josh Spiegel added a comment - First of all, thanks for fixing this bug. I was looking at this fix (831926) and I think there may be a problem but I am not positive. I apologize in advance if I am mistaken. When interpreting a case insensitive range, the code seems to add the lower and upper case of each character in the range. (see the new RegexParser.addCaseInsensitiveChar and RegexParser.addCaseInsensitiveCharRange). However, it is my understanding that not all character case mappings in unicode are invertible like this ( http://unicode.org/faq/casemap_charprop.html#2 ) For example both capital K and the kelvin sign have a lower-case of 'k': lower-case( ['K' - 0x004B] ) == 'k' AND lower-case( [Kelvin-sign - 0x212A] ) == 'k' So, if I have a regular expression 'k', in case insensitive mode shouldn't this match both 'K' and the Kelvin-sign? Currently it seems it would only match 'k' or 'K'. Thanks.
        Hide
        Khaled Noaman added a comment -

        You're right. There's a problem with some characters not being matched properly. I've checked in a fix for that.

        Show
        Khaled Noaman added a comment - You're right. There's a problem with some characters not being matched properly. I've checked in a fix for that.

          People

          • Assignee:
            Khaled Noaman
            Reporter:
            Radu Preotiuc-Pietro
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development