Accumulo
  1. Accumulo
  2. ACCUMULO-209

RegExFilter does not properly regex when using multi-byte characters

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.5
    • Fix Version/s: 1.4.0, 1.5.0
    • Component/s: client
    • Labels:
      None

      Description

      The current RegExFilter class uses a ByteArrayBackedCharSequence to set the data to match against. The ByteArrayBackedCharSequence contains a line of code that prevents the matcher from properly matching multi-byte characters.

      Line 49 of ByteArrayBackedCharSequence.java is:
      return (char) (0xff & data[offset + index]);

      This incorrectly casts a single byte from the byte array to a char, which is 2 bytes in Java. This prevents the RegExFilter from properly performing Regular Expressions on multi-byte character encoded values.

      A patch for the RegExFilter.java file has been created and will be submitted.

      1. accumulo-209-RegExFilterTest.patch
        1 kB
        Jim Klucar
      2. accumulo-209-RegExFilter.patch
        4 kB
        Jim Klucar
      3. accumulo-209.patch
        3 kB
        Jim Klucar

        Activity

        Jim Klucar created issue -
        Hide
        Jim Klucar added a comment -

        Patch is for the current Accumulo trunk.

        Show
        Jim Klucar added a comment - Patch is for the current Accumulo trunk.
        Jim Klucar made changes -
        Field Original Value New Value
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 1.5.0 [ 12318645 ]
        Affects Version/s 1.3.5 [ 12318442 ]
        Fix Version/s 1.5.0 [ 12318645 ]
        Jim Klucar made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Jim Klucar made changes -
        Attachment accumulo-209.patch [ 12506514 ]
        Billie Rinaldi made changes -
        Assignee Billie Rinaldi [ billie.rinaldi ] Jim Klucar [ jklucar ]
        Hide
        Billie Rinaldi added a comment -

        Jim, thanks for finding and fixing this bug. Could you write a test that reveals the issue and is fixed by your patch? Also, please Apache-license your patches.

        Show
        Billie Rinaldi added a comment - Jim, thanks for finding and fixing this bug. Could you write a test that reveals the issue and is fixed by your patch? Also, please Apache-license your patches.
        Billie Rinaldi made changes -
        Fix Version/s 1.4.0 [ 12318441 ]
        Affects Version/s 1.3.5 [ 12318442 ]
        Affects Version/s 1.5.0 [ 12318645 ]
        Hide
        Jim Klucar added a comment -

        You're welcome. Is it ok just to patch the existing RegExFilterTest.java or should I create a new test class?

        Show
        Jim Klucar added a comment - You're welcome. Is it ok just to patch the existing RegExFilterTest.java or should I create a new test class?
        Hide
        Billie Rinaldi added a comment -

        Feel free to add things to the existing test class.

        Show
        Billie Rinaldi added a comment - Feel free to add things to the existing test class.
        Hide
        Jim Klucar added a comment -

        Updated patch generated from base trunk directory. Also, added static setEncoding method

        Show
        Jim Klucar added a comment - Updated patch generated from base trunk directory. Also, added static setEncoding method
        Jim Klucar made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Assignee Jim Klucar [ jklucar ] Billie Rinaldi [ billie.rinaldi ]
        Hide
        Jim Klucar added a comment -

        Patches RegExFilter class to properly handle multi-byte character encodings. Test case updated to handle regression.

        Show
        Jim Klucar added a comment - Patches RegExFilter class to properly handle multi-byte character encodings. Test case updated to handle regression.
        Jim Klucar made changes -
        Attachment accumulo-209-RegExFilter.patch [ 12506666 ]
        Attachment accumulo-209-RegExFilterTest.patch [ 12506667 ]
        Hide
        Keith Turner added a comment -

        I noticed a problem with the patch. I needs to do the following :

             matcher.reset(new String(bs.getBackingArray(), bs.offset(), bs.length(), encoding));
        

        instead of

             matcher.reset(new String(bs.getBackingArray(), encoding));
        

        The byte sequence may use a subset of the backing array.

        Show
        Keith Turner added a comment - I noticed a problem with the patch. I needs to do the following : matcher.reset(new String(bs.getBackingArray(), bs.offset(), bs.length(), encoding)); instead of matcher.reset(new String(bs.getBackingArray(), encoding)); The byte sequence may use a subset of the backing array.
        Billie Rinaldi made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Billie Rinaldi added a comment -

        ByteArrayBackedCharSequence doesn't appear to be used by anything else. Keith pointed out that the new code copies the data each time, which the old code did not. We may want to take a look at the performance difference.

        Show
        Billie Rinaldi added a comment - ByteArrayBackedCharSequence doesn't appear to be used by anything else. Keith pointed out that the new code copies the data each time, which the old code did not. We may want to take a look at the performance difference.
        Gavin made changes -
        Workflow no-reopen-closed, patch-avail [ 12644963 ] patch-available, re-open possible [ 12671671 ]

          People

          • Assignee:
            Billie Rinaldi
            Reporter:
            Jim Klucar
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 1h
              1h
              Remaining:
              Remaining Estimate - 1h
              1h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development