Accumulo
  1. Accumulo
  2. ACCUMULO-209

RegExFilter does not properly regex when using multi-byte characters

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.5
    • Fix Version/s: 1.4.0, 1.5.0
    • Component/s: client
    • Labels:
      None

      Description

      The current RegExFilter class uses a ByteArrayBackedCharSequence to set the data to match against. The ByteArrayBackedCharSequence contains a line of code that prevents the matcher from properly matching multi-byte characters.

      Line 49 of ByteArrayBackedCharSequence.java is:
      return (char) (0xff & data[offset + index]);

      This incorrectly casts a single byte from the byte array to a char, which is 2 bytes in Java. This prevents the RegExFilter from properly performing Regular Expressions on multi-byte character encoded values.

      A patch for the RegExFilter.java file has been created and will be submitted.

      1. accumulo-209.patch
        3 kB
        Jim Klucar
      2. accumulo-209-RegExFilter.patch
        4 kB
        Jim Klucar
      3. accumulo-209-RegExFilterTest.patch
        1 kB
        Jim Klucar

        Activity

        Hide
        Jim Klucar added a comment -

        Patch is for the current Accumulo trunk.

        Show
        Jim Klucar added a comment - Patch is for the current Accumulo trunk.
        Hide
        Billie Rinaldi added a comment -

        Jim, thanks for finding and fixing this bug. Could you write a test that reveals the issue and is fixed by your patch? Also, please Apache-license your patches.

        Show
        Billie Rinaldi added a comment - Jim, thanks for finding and fixing this bug. Could you write a test that reveals the issue and is fixed by your patch? Also, please Apache-license your patches.
        Hide
        Jim Klucar added a comment -

        You're welcome. Is it ok just to patch the existing RegExFilterTest.java or should I create a new test class?

        Show
        Jim Klucar added a comment - You're welcome. Is it ok just to patch the existing RegExFilterTest.java or should I create a new test class?
        Hide
        Billie Rinaldi added a comment -

        Feel free to add things to the existing test class.

        Show
        Billie Rinaldi added a comment - Feel free to add things to the existing test class.
        Hide
        Jim Klucar added a comment -

        Updated patch generated from base trunk directory. Also, added static setEncoding method

        Show
        Jim Klucar added a comment - Updated patch generated from base trunk directory. Also, added static setEncoding method
        Hide
        Jim Klucar added a comment -

        Patches RegExFilter class to properly handle multi-byte character encodings. Test case updated to handle regression.

        Show
        Jim Klucar added a comment - Patches RegExFilter class to properly handle multi-byte character encodings. Test case updated to handle regression.
        Hide
        Keith Turner added a comment -

        I noticed a problem with the patch. I needs to do the following :

             matcher.reset(new String(bs.getBackingArray(), bs.offset(), bs.length(), encoding));
        

        instead of

             matcher.reset(new String(bs.getBackingArray(), encoding));
        

        The byte sequence may use a subset of the backing array.

        Show
        Keith Turner added a comment - I noticed a problem with the patch. I needs to do the following : matcher.reset(new String(bs.getBackingArray(), bs.offset(), bs.length(), encoding)); instead of matcher.reset(new String(bs.getBackingArray(), encoding)); The byte sequence may use a subset of the backing array.
        Hide
        Billie Rinaldi added a comment -

        ByteArrayBackedCharSequence doesn't appear to be used by anything else. Keith pointed out that the new code copies the data each time, which the old code did not. We may want to take a look at the performance difference.

        Show
        Billie Rinaldi added a comment - ByteArrayBackedCharSequence doesn't appear to be used by anything else. Keith pointed out that the new code copies the data each time, which the old code did not. We may want to take a look at the performance difference.

          People

          • Assignee:
            Billie Rinaldi
            Reporter:
            Jim Klucar
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 1h
              1h
              Remaining:
              Remaining Estimate - 1h
              1h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development