HBase
  1. HBase
  2. HBASE-6618

Implement FuzzyRowFilter with ranges support

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 2.0.0, 1.3.0
    • Component/s: Filters
    • Labels:
      None

      Description

      Apart from current ability to specify fuzzy row filter e.g. for <userId_actionId> format as ????_0004 (where 0004 - actionId) it would be great to also have ability to specify the "fuzzy range" , e.g. ????_0004, ..., ????_0099.

      See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65

      Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter, but in case when the range is big (contains thousands of values) it is not efficient.

      Filter should perform efficient fast-forwarding during the scan (this is what distinguishes it from regex row filter).

      While such functionality may seem like a proper fit for custom filter (i.e. not including into standard filter set) it looks like the filter may be very re-useable. We may judge based on the implementation that will hopefully be added.

      1. HBASE-6618_5.patch
        119 kB
        Alex Baranau
      2. HBASE-6618_4.patch
        98 kB
        Alex Baranau
      3. HBASE-6618_3.path
        88 kB
        Alex Baranau
      4. HBASE-6618_2.path
        88 kB
        Alex Baranau
      5. HBASE-6618.patch
        88 kB
        Alex Baranau
      6. HBASE-6618-algo-desc-bits.png
        146 kB
        Alex Baranau
      7. HBASE-6618-algo.patch
        8 kB
        Alex Baranau

        Issue Links

          Activity

          Hide
          Alex Baranau added a comment -

          totally agree on overloading ctor. Will add that. Also will see if Builder makes sense: it'd help with these lists of pairs/triples.

          Thank you for looking at the patch, Igor Kuzmitshov!

          Show
          Alex Baranau added a comment - totally agree on overloading ctor. Will add that. Also will see if Builder makes sense: it'd help with these lists of pairs/triples. Thank you for looking at the patch, Igor Kuzmitshov !
          Hide
          Igor Kuzmitshov added a comment -

          Alex Baranau, you are right about keeping the mask separate, somehow I forgot that ? can be a “normal byte”, sorry.

          I have just checked other Filters, it seems that all are quite low-level and use byte arrays as constructor parameters. It makes sense to use byte arrays as parameters to be consistent, but adding a builder could be nice as well.

          For me, the biggest “inconvenience” (especially when using HBase shell) of constructing a FuzzyRowFilter is not in byte arrays themselves, but in Lists of Pairs (or Triples) of byte arrays. I would add a simpler constructor for one rule (I guess one rule would be enough quite often) and a separate method to add rules:

          FuzzyRowFilter(byte[] fuzzyInfo, byte[] lowerBytes, byte[] upperBytes)
          void addRule(byte[] fuzzyInfo, byte[] lowerBytes, byte[] upperBytes)
          
          Show
          Igor Kuzmitshov added a comment - Alex Baranau , you are right about keeping the mask separate, somehow I forgot that ? can be a “normal byte”, sorry. I have just checked other Filters, it seems that all are quite low-level and use byte arrays as constructor parameters. It makes sense to use byte arrays as parameters to be consistent, but adding a builder could be nice as well. For me, the biggest “inconvenience” (especially when using HBase shell) of constructing a FuzzyRowFilter is not in byte arrays themselves, but in Lists of Pairs (or Triples) of byte arrays. I would add a simpler constructor for one rule (I guess one rule would be enough quite often) and a separate method to add rules: FuzzyRowFilter( byte [] fuzzyInfo, byte [] lowerBytes, byte [] upperBytes) void addRule( byte [] fuzzyInfo, byte [] lowerBytes, byte [] upperBytes)
          Hide
          Alex Baranau added a comment -

          got it thanx.

          We could use '\' before '?' to define the normal byte '?'

          and then \ before \ if we need \. And so on.

          I mean we can do that. Having Strings with special chars where impl expects any bytes as normal input at times not trivial. And if we really want that we would make it in API in some from of standard. And then, we could have it everywhere, e.g. in Puts, etc. I am not sure we want to create specific format for one filter..

          What are your thoughts on builder? seems like can help to avoid all those special chars and still keep it very human-friendly. We can allow also Strings as I mentioned with \x notation. But the difference is that no pain with special chars and easy guiding API for users...

          Show
          Alex Baranau added a comment - got it thanx. We could use '\' before '?' to define the normal byte '?' and then \ before \ if we need \. And so on. I mean we can do that. Having Strings with special chars where impl expects any bytes as normal input at times not trivial. And if we really want that we would make it in API in some from of standard. And then, we could have it everywhere, e.g. in Puts, etc. I am not sure we want to create specific format for one filter.. What are your thoughts on builder? seems like can help to avoid all those special chars and still keep it very human-friendly. We can allow also Strings as I mentioned with \x notation. But the difference is that no pain with special chars and easy guiding API for users...
          Hide
          chunhui shen added a comment -

          not sure I got the question, sorry: which checks?

          + * <p>
          + *   NOTE that currently no checks are performed to ensure that length of ranges lower bytes and
          + *   ranges upper bytes match mask length. Filter may work incorrectly or fail (with runtime
          + *   exceptions) if this is broken.
          + * </p>
          + *
          + * <p>
          + *   NOTE that currently no checks are performed to ensure that ranges are defined correctly (i.e.
          + *   lower value of each range is not greater than upper value). Filter may work incorrectly or fail
          + *   (with runtime exceptions) if this is broken.
          + * </p>
          + *
          + * <p>
          + *   NOTE that currently no checks are performed to ensure that at non-fixed positions in
          + *   ranges lower bytes and ranges upper bytes zeroes are set, but implementation may rely on this.
          + * </p>
          

          I mean the above checks

          Show
          chunhui shen added a comment - not sure I got the question, sorry: which checks? + * <p> + * NOTE that currently no checks are performed to ensure that length of ranges lower bytes and + * ranges upper bytes match mask length. Filter may work incorrectly or fail (with runtime + * exceptions) if this is broken. + * </p> + * + * <p> + * NOTE that currently no checks are performed to ensure that ranges are defined correctly (i.e. + * lower value of each range is not greater than upper value). Filter may work incorrectly or fail + * (with runtime exceptions) if this is broken. + * </p> + * + * <p> + * NOTE that currently no checks are performed to ensure that at non-fixed positions in + * ranges lower bytes and ranges upper bytes zeroes are set, but implementation may rely on this . + * </p> I mean the above checks
          Hide
          chunhui shen added a comment - - edited

          you have to somehow define how to put "?" if I want it as "normal byte".

          We could use '\' before '?' to define the normal byte '?'

          As my consideration, user could construct FuzzyRowFilter with the readable String directly.
          e.g

          ???11??AA\x00??\?

          Using Bytes.toBytesBinary to convert the string to bytes, then parse the bytes, if the byte is '?', mark it as non-fixed byte, if the byte is '\', skip it and see the next byte, and so on

          Of course, if user want to make '\x00' as 4 bytes, the above seems wrong.
          For this case, we should also support constructing FuzzyRowFilter with the readable byte array.
          For example,

          ???11??AA\x00??\? 

          =>
          byte[0]='?'
          byte[1]='?'
          byte[2]='?'
          byte[3]='1'
          byte[4]='1'
          byte[5]='?'
          byte[6]='?'
          byte[7]='A'
          byte[8]='A'
          byte[9]=0
          byte[10]='?'
          byte[11]='?'
          byte[12]='\'
          byte[13]='?'

          Correct me if something wrong

          Show
          chunhui shen added a comment - - edited you have to somehow define how to put "?" if I want it as "normal byte". We could use '\' before '?' to define the normal byte '?' As my consideration, user could construct FuzzyRowFilter with the readable String directly. e.g ???11??AA\x00??\? Using Bytes.toBytesBinary to convert the string to bytes, then parse the bytes, if the byte is '?', mark it as non-fixed byte, if the byte is '\', skip it and see the next byte, and so on Of course, if user want to make '\x00' as 4 bytes, the above seems wrong. For this case, we should also support constructing FuzzyRowFilter with the readable byte array. For example, ???11??AA\x00??\? => byte [0] ='?' byte [1] ='?' byte [2] ='?' byte [3] ='1' byte [4] ='1' byte [5] ='?' byte [6] ='?' byte [7] ='A' byte [8] ='A' byte [9] =0 byte [10] ='?' byte [11] ='?' byte [12] ='\' byte [13] ='?' Correct me if something wrong
          Hide
          Alex Baranau added a comment -

          with human-readable and "?" inside - you have to somehow define how to put "?" if I want it as "normal byte"...

          Show
          Alex Baranau added a comment - with human-readable and "?" inside - you have to somehow define how to put "?" if I want it as "normal byte"...
          Hide
          Alex Baranau added a comment -

          right. I mean it won't be human-friendly though still... I thought more about smth like this:

          new FuzzyRowFilter.Builder()
            .any(<length>) // meaning "????" for 4
            .range(<range_start_bytes>, <range_end_bytes>)  // builder will check that length of those is the same
            .any(<length>)
            .fixed(<couple_fixed_bytes>)
            .build();
          

          We may also overload with allowing strings if makes sense. So that e.g. "???(11-88)??AAA" could be built with:

          new FuzzyRowFilter.Builder()
            .any(3)
            .range("11", "88")
            .any(2)
            .fixed("AAA")
            .build();
          

          thoughts?

          Show
          Alex Baranau added a comment - right. I mean it won't be human-friendly though still... I thought more about smth like this: new FuzzyRowFilter.Builder() .any(<length>) // meaning "????" for 4 .range(<range_start_bytes>, <range_end_bytes>) // builder will check that length of those is the same .any(<length>) .fixed(<couple_fixed_bytes>) .build(); We may also overload with allowing strings if makes sense. So that e.g. "???(11-88)??AAA" could be built with: new FuzzyRowFilter.Builder() .any(3) .range( "11" , "88" ) .any(2) .fixed( "AAA" ) .build(); thoughts?
          Hide
          Igor Kuzmitshov added a comment -

          Using (human-readable) strings instead of byte arrays seems possible when non-printable bytes are given in \x00 format (widely used in HBase) and conversions are done with toBytesBinary() and toStringBinary() of org.apache.hadoop.hbase.util.Bytes. Example: from "??a\x00" to "??c\x1F".

          Show
          Igor Kuzmitshov added a comment - Using (human-readable) strings instead of byte arrays seems possible when non-printable bytes are given in \x00 format (widely used in HBase) and conversions are done with toBytesBinary() and toStringBinary() of org.apache.hadoop.hbase.util.Bytes. Example: from "??a\x00" to "??c\x1F".
          Hide
          Alex Baranau added a comment -

          Making it more useable with better API for construction of the filter makes a lot of sense to me. It is not that simple as just defining human-readable string as bytes might not be human readable. I'm thinking about builder-like construction of the filter, should be helpful. As an addition to "raw definition". Will share thoughts/changes very soon if I figure out a good way for it.

          There are many notion when using FuzzyRowFilter, could we do these checks in the internal of FuzzyRowFilter ?

          not sure I got the question, sorry: which checks?

          Show
          Alex Baranau added a comment - Making it more useable with better API for construction of the filter makes a lot of sense to me. It is not that simple as just defining human-readable string as bytes might not be human readable. I'm thinking about builder-like construction of the filter, should be helpful. As an addition to "raw definition". Will share thoughts/changes very soon if I figure out a good way for it. There are many notion when using FuzzyRowFilter, could we do these checks in the internal of FuzzyRowFilter ? not sure I got the question, sorry: which checks?
          Hide
          chunhui shen added a comment -

          FuzzyRowFilter seems available only for advanced users.

          Should we support creating FuzzyRowFilter with the Human-Readable String, e.g. ??0004
          There are many notion when using FuzzyRowFilter, could we do these checks in the internal of FuzzyRowFilter ?

          Show
          chunhui shen added a comment - FuzzyRowFilter seems available only for advanced users. Should we support creating FuzzyRowFilter with the Human-Readable String, e.g. ?? 0004 There are many notion when using FuzzyRowFilter, could we do these checks in the internal of FuzzyRowFilter ?
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12638881/HBASE-6618_5.patch
          against trunk revision .
          ATTACHMENT ID: 12638881

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638881/HBASE-6618_5.patch against trunk revision . ATTACHMENT ID: 12638881 +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 9 new or modified tests. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9206//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          Thanks for the quick turnaround.

          Andrew Purtell: do you want this in 0.98 ?

          Show
          Ted Yu added a comment - Thanks for the quick turnaround. Andrew Purtell : do you want this in 0.98 ?
          Hide
          Alex Baranau added a comment -

          Igor Kuzmitshov, heh - in satisfies() found that it was incorrect. I.e. as you described. Sorry I overlooked that. Fixed in latest patch.

          Ted Yu added more tests and also test for compat with client compiled against old version of code. Though I'd really ask you to check if that was done correctly: don't have a lot of experience with protobuf.

          Thank you guys!

          Show
          Alex Baranau added a comment - Igor Kuzmitshov , heh - in satisfies() found that it was incorrect. I.e. as you described. Sorry I overlooked that. Fixed in latest patch. Ted Yu added more tests and also test for compat with client compiled against old version of code. Though I'd really ask you to check if that was done correctly: don't have a lot of experience with protobuf. Thank you guys!
          Hide
          Alex Baranau added a comment -

          addressed review comments

          Show
          Alex Baranau added a comment - addressed review comments
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12638828/HBASE-6618_4.patch
          against trunk revision .
          ATTACHMENT ID: 12638828

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638828/HBASE-6618_4.patch against trunk revision . ATTACHMENT ID: 12638828 +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 9 new or modified tests. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9204//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          If a client, compiled with FuzzyRowFilter before this change, uses FuzzyRowFilter in Scan, would server side be able to handle ?

          Please add more tests for the new rules. This would make code more robust and help detect regression.

          Show
          Ted Yu added a comment - If a client, compiled with FuzzyRowFilter before this change, uses FuzzyRowFilter in Scan, would server side be able to handle ? Please add more tests for the new rules. This would make code more robust and help detect regression.
          Hide
          Alex Baranau added a comment -

          I mean than the one available in HBase currently

          Show
          Alex Baranau added a comment - I mean than the one available in HBase currently
          Hide
          Alex Baranau added a comment -

          Updated patch, also uploaded to review board at https://reviews.apache.org/r/8786. Very small change to fit latest trunk. Ted Yu if you have time by chance - very much appreciate if you can review. This version is much better, much more flexible than current one.. Thank you a lot in advance

          Show
          Alex Baranau added a comment - Updated patch, also uploaded to review board at https://reviews.apache.org/r/8786 . Very small change to fit latest trunk. Ted Yu if you have time by chance - very much appreciate if you can review. This version is much better, much more flexible than current one.. Thank you a lot in advance
          Hide
          Alex Baranau added a comment -

          Updated patch to fit latest trunk

          Show
          Alex Baranau added a comment - Updated patch to fit latest trunk
          Hide
          Alex Baranau added a comment -

          Igor Kuzmitshov

          I thought that the value in the fixed part is checked as whole, but the code actually checks its bytes in isolation, so the rule is actually ????0(0 - 9)(0 - 9)(1 - 9)

          not true. aa68 will satisfy the rule ??(53 - 97). Added a test specifically for that:

              // Range
              Assert.assertEquals(FuzzyRowFilter.SatisfiesCode.YES,
                                  FuzzyRowFilter.satisfies(
                                    new byte[]{1, 1, 6, 8},
                                    new Triple<byte[], byte[], byte[]>(
                                      new byte[]{0, 0, 1, 1}, // mask
                                      new byte[]{1, 1, 5, 6}, // upper bytes
                                      new byte[]{1, 1, 9, 7}))); // lower bytes
          
          Show
          Alex Baranau added a comment - Igor Kuzmitshov I thought that the value in the fixed part is checked as whole, but the code actually checks its bytes in isolation, so the rule is actually ????0(0 - 9)(0 - 9)(1 - 9) not true. aa68 will satisfy the rule ??(53 - 97). Added a test specifically for that: // Range Assert.assertEquals(FuzzyRowFilter.SatisfiesCode.YES, FuzzyRowFilter.satisfies( new byte []{1, 1, 6, 8}, new Triple< byte [], byte [], byte []>( new byte []{0, 0, 1, 1}, // mask new byte []{1, 1, 5, 6}, // upper bytes new byte []{1, 1, 9, 7}))); // lower bytes
          Hide
          Igor Kuzmitshov added a comment -

          Please note that in the version proposed by me (aa68 should satisfy rule ??(53 - 97)) it's not possible to have adjacent ranges in the rule: the high-level ??(10-19)(00-30) and ??(1000-1930) will be written as the same range (key start, key end, mask): ??1000, ??1930, 110000. This can be solved by using different values in the mask (it would be more convenient to use 0 for non-fixed bytes, 1 for range 1, 2 for range 2 and so on).

          Show
          Igor Kuzmitshov added a comment - Please note that in the version proposed by me (aa68 should satisfy rule ??(53 - 97)) it's not possible to have adjacent ranges in the rule: the high-level ??(10-19)(00-30) and ??(1000-1930) will be written as the same range (key start, key end, mask): ??1000, ??1930, 110000. This can be solved by using different values in the mask (it would be more convenient to use 0 for non-fixed bytes, 1 for range 1, 2 for range 2 and so on).
          Hide
          Igor Kuzmitshov added a comment -

          Looking at the description above that rule ????(0001 - 0999) means <any 4 bytes><any 4 bytes value between "0001" and "0999">, I thought that the value in the fixed part is checked as whole, but the code actually checks its bytes in isolation, so the rule is actually ????0(0 - 9)(0 - 9)(1 - 9).

          It's fine for ranges like this, but let's take another: ??(53 - 97). I would expect aa68 to satisfy the rule, but in the proposed implementation it doesn't (because bytes are checked in isolation and 8 is outside the range [3, 7]). Could you clarify if this is the intended behaviour?

          If yes, i.e. aa68 should not satisfy rule ??(53 - 97):
          It would be nice to make it more clear in the description that all bytes are checked in isolation and there are actually no n-bytes values. In this case, there is a bug: for rule ??(50 - 97) and value MM58 (where M is max byte \xFF), satisfies() returns SatisfiesCode.NO_NEXT because nextRowKeyCandidateExists is only updated for non-fixed positions. It should return NEXT_EXISTS, because MM60 should be the next key.

          If no, i.e. aa68 should satisfy rule ??(53 - 97):
          In this case, satisfy() should be fixed. I made a patch with the fix and can add it if needed. It also has a small optimisation when there is no need to check less significant bytes. For example: for range [120, 500] and key 345, it will compare the first byte (3) only, as it's clear that the whole value is in the range.

          In any case, tests might include testing satisfy() with ranges (the current patch only adds tests for getNextForFuzzyRule() with ranges).

          Show
          Igor Kuzmitshov added a comment - Looking at the description above that rule ????(0001 - 0999) means <any 4 bytes><any 4 bytes value between "0001" and "0999">, I thought that the value in the fixed part is checked as whole, but the code actually checks its bytes in isolation, so the rule is actually ????0(0 - 9)(0 - 9)(1 - 9). It's fine for ranges like this, but let's take another: ??(53 - 97). I would expect aa68 to satisfy the rule, but in the proposed implementation it doesn't (because bytes are checked in isolation and 8 is outside the range [3, 7]). Could you clarify if this is the intended behaviour? If yes, i.e. aa68 should not satisfy rule ??(53 - 97): It would be nice to make it more clear in the description that all bytes are checked in isolation and there are actually no n-bytes values. In this case, there is a bug: for rule ??(50 - 97) and value MM58 (where M is max byte \xFF), satisfies() returns SatisfiesCode.NO_NEXT because nextRowKeyCandidateExists is only updated for non-fixed positions. It should return NEXT_EXISTS, because MM60 should be the next key. If no, i.e. aa68 should satisfy rule ??(53 - 97): In this case, satisfy() should be fixed. I made a patch with the fix and can add it if needed. It also has a small optimisation when there is no need to check less significant bytes. For example: for range [120, 500] and key 345, it will compare the first byte (3) only, as it's clear that the whole value is in the range. In any case, tests might include testing satisfy() with ranges (the current patch only adds tests for getNextForFuzzyRule() with ranges).
          Hide
          Ted Yu added a comment -

          Alex:
          I should have time to review.
          Please update the patch.

          Thanks

          Show
          Ted Yu added a comment - Alex: I should have time to review. Please update the patch. Thanks
          Hide
          Alex Baranau added a comment -

          Yeah, looks like nobody looked at the patch, even though I know others use it (patch). Weird and don't know how to push anyone to do that.

          Not sure if patch fits latest version. If there's still an interest from any committer (I hope so) to take a look and proceed with the issue, I will take a look at it and make sure it is good for latest version.

          Show
          Alex Baranau added a comment - Yeah, looks like nobody looked at the patch, even though I know others use it (patch). Weird and don't know how to push anyone to do that. Not sure if patch fits latest version. If there's still an interest from any committer (I hope so) to take a look and proceed with the issue, I will take a look at it and make sure it is good for latest version.
          Hide
          Andrew Purtell added a comment -

          Stale issue

          Show
          Andrew Purtell added a comment - Stale issue
          Hide
          James Taylor added a comment -

          Yes, that's a typo. Thanks - you've got good eyes!

          Show
          James Taylor added a comment - Yes, that's a typo. Thanks - you've got good eyes!
          Hide
          Ted Yu added a comment - - edited

          skip hint as ['foobar'][4000] (i.e.

          'foobar' + new byte[] {0} + new byte[] {1,0,0,0}
          

          ).
          Is there a typo above ?
          Should it be:

          'foobar' + new byte[] {0} + new byte[] {4,0,0,0}
          
          Show
          Ted Yu added a comment - - edited skip hint as ['foobar'] [4000] (i.e. 'foobar' + new byte [] {0} + new byte [] {1,0,0,0} ). Is there a typo above ? Should it be: 'foobar' + new byte [] {0} + new byte [] {4,0,0,0}
          Hide
          James Taylor added a comment -

          Would it be possible to generalize this a bit further to handle variable length key parts assuming you know the terminator (both Phoenix and Orderly, use a 0 byte terminator)? With the work you've already done here to support ranges, you could support a good set of skip scanning scenarios for multi-part row keys.

          Take for example a fuzzy row filter expressed for a two part row key of VARCHAR + byte[4] like this:

          'foo%' (this would be for the VARCHAR key part - 'foo' followed by zero or more characters)
          [4000-6000) (this would be for the INT key part - 4000 inclusive to 6000 exclusive)

          In this case (as you've already pointed out), you can use the first row key as your guide. Let's say the first row key is ['foobar'][1000]. You could form a skip hint as ['foobar'][4000] (i.e. 'foobar' + new byte[]

          {0}

          + new byte[]

          {1,0,0,0}

          ).
          Then you'd let all values pass until you got to ['foobar'][6000], in which case you'd form your next skip hint.

          Show
          James Taylor added a comment - Would it be possible to generalize this a bit further to handle variable length key parts assuming you know the terminator (both Phoenix and Orderly, use a 0 byte terminator)? With the work you've already done here to support ranges, you could support a good set of skip scanning scenarios for multi-part row keys. Take for example a fuzzy row filter expressed for a two part row key of VARCHAR + byte [4] like this: 'foo%' (this would be for the VARCHAR key part - 'foo' followed by zero or more characters) [4000-6000) (this would be for the INT key part - 4000 inclusive to 6000 exclusive) In this case (as you've already pointed out), you can use the first row key as your guide. Let's say the first row key is ['foobar'] [1000] . You could form a skip hint as ['foobar'] [4000] (i.e. 'foobar' + new byte[] {0} + new byte[] {1,0,0,0} ). Then you'd let all values pass until you got to ['foobar'] [6000] , in which case you'd form your next skip hint.
          Hide
          Alex Baranau added a comment -

          -1 lineLengths. The patch introduces lines longer than 100

          I guess that's because of generated code (from *.proto). That should be OK, right?

          Show
          Alex Baranau added a comment - -1 lineLengths. The patch introduces lines longer than 100 I guess that's because of generated code (from *.proto). That should be OK, right?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12563115/HBASE-6618_3.path
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 lineLengths. The patch introduces lines longer than 100

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.TestMetricsRegionServer
          org.apache.hadoop.hbase.ipc.TestRpcMetrics

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12563115/HBASE-6618_3.path against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 6 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 lineLengths . The patch introduces lines longer than 100 -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestMetricsRegionServer org.apache.hadoop.hbase.ipc.TestRpcMetrics Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3826//console This message is automatically generated.
          Hide
          Alex Baranau added a comment -

          Fixed line length in patch (one was >100).

          Not sure what about those tests: they work well for me locally.

          Show
          Alex Baranau added a comment - Fixed line length in patch (one was >100). Not sure what about those tests: they work well for me locally.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12562822/HBASE-6618_2.path
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 lineLengths. The patch introduces lines longer than 100

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.TestMetricsRegionServer
          org.apache.hadoop.hbase.ipc.TestRpcMetrics

          -1 core zombie tests. There are 3 zombie test(s):

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562822/HBASE-6618_2.path against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 6 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 lineLengths . The patch introduces lines longer than 100 -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestMetricsRegionServer org.apache.hadoop.hbase.ipc.TestRpcMetrics -1 core zombie tests . There are 3 zombie test(s): Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3789//console This message is automatically generated.
          Hide
          Alex Baranau added a comment -

          Fixed typo, now filter tests should pass. Not sure about why metrics tests failed. Will look at it if occur again

          Show
          Alex Baranau added a comment - Fixed typo, now filter tests should pass. Not sure about why metrics tests failed. Will look at it if occur again
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12562810/HBASE-6618.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 lineLengths. The patch introduces lines longer than 100

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.TestMetricsRegionServer
          org.apache.hadoop.hbase.filter.TestFuzzyRowFilter
          org.apache.hadoop.hbase.ipc.TestRpcMetrics

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562810/HBASE-6618.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 6 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 lineLengths . The patch introduces lines longer than 100 -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestMetricsRegionServer org.apache.hadoop.hbase.filter.TestFuzzyRowFilter org.apache.hadoop.hbase.ipc.TestRpcMetrics Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3788//console This message is automatically generated.
          Hide
          Anil Gupta added a comment -

          Hi Alex,

          Actually, we decided not to use full-table scans for this type of query. Hence, i could not devote time on this. We are trying out Secondary Index in HBase.
          Sorry, for the late update.

          Wish you a Happy New Year!
          ~Anil Gupta
          Software Engineer II, Intuit, Inc

          Show
          Anil Gupta added a comment - Hi Alex, Actually, we decided not to use full-table scans for this type of query. Hence, i could not devote time on this. We are trying out Secondary Index in HBase. Sorry, for the late update. Wish you a Happy New Year! ~Anil Gupta Software Engineer II, Intuit, Inc
          Hide
          Alex Baranau added a comment -

          Created https://reviews.apache.org/r/8786. I think I will add more unit-tests. Comments are very welcome!

          Show
          Alex Baranau added a comment - Created https://reviews.apache.org/r/8786 . I think I will add more unit-tests. Comments are very welcome!
          Hide
          Alex Baranau added a comment -

          Looks like Anil didn't find time for that in the end.

          But I believe that this functionality is very very useful. Will work on it then.

          Attached patch with implemented filter functionality. I have doubts about filter API (i.e. defining fuzzy rules with Triple<byte[], byte[], byte[]), any suggestions are very welcome!

          Show
          Alex Baranau added a comment - Looks like Anil didn't find time for that in the end. But I believe that this functionality is very very useful. Will work on it then. Attached patch with implemented filter functionality. I have doubts about filter API (i.e. defining fuzzy rules with Triple<byte[], byte[], byte[]), any suggestions are very welcome!
          Hide
          Alex Baranau added a comment -

          Weird. I can open it. Anyhow, sent it to your email.

          Show
          Alex Baranau added a comment - Weird. I can open it. Anyhow, sent it to your email.
          Hide
          Anil Gupta added a comment -

          Hi Alex,

          I am still unable to access the png file for the algorithm. Is there some problem with JIRA system? or Can you re-upload the image?

          Thanks,
          Thanks,
          Anil Gupta
          Software Engineer II, Intuit, Inc

          Show
          Anil Gupta added a comment - Hi Alex, I am still unable to access the png file for the algorithm. Is there some problem with JIRA system? or Can you re-upload the image? Thanks, Thanks, Anil Gupta Software Engineer II, Intuit, Inc
          Hide
          Anil Gupta added a comment -

          Hi,

          Yesterday night in the middle of my reply the JIRA went into maintenance mode. Now, it seems to be out of Maintenance mode but i cannot access the png file attached to this jira. Is there any issue going on with JIRA?

          Thanks,
          Anil Gupta
          Software Engineer II, Intuit, Inc

          Show
          Anil Gupta added a comment - Hi, Yesterday night in the middle of my reply the JIRA went into maintenance mode. Now, it seems to be out of Maintenance mode but i cannot access the png file attached to this jira. Is there any issue going on with JIRA? Thanks, Anil Gupta Software Engineer II, Intuit, Inc
          Hide
          Ted Yu added a comment -

          Enhancing existing class is fine.

          Show
          Ted Yu added a comment - Enhancing existing class is fine.
          Hide
          Alex Baranau added a comment -

          Ah, sorry, haven't said anything about that. For toInc - we may not change it at every step, so if there's a missing arrow, that means nothing should be changed.

          Thanx for checking out!

          One thing that I'm not 100% sure about - is it better to adjust current FuzzyRowFilter and this functionality to it or add new. I'm leaning towards adjusting FuzzyRowFilter as this new feature fits naturally in it. Thoughts?

          Show
          Alex Baranau added a comment - Ah, sorry, haven't said anything about that. For toInc - we may not change it at every step, so if there's a missing arrow, that means nothing should be changed. Thanx for checking out! One thing that I'm not 100% sure about - is it better to adjust current FuzzyRowFilter and this functionality to it or add new. I'm leaning towards adjusting FuzzyRowFilter as this new feature fits naturally in it. Thoughts?
          Hide
          Ted Yu added a comment -

          Thanks for the update, Alex.
          I get your idea, though a few arrows seem to be missing (e.g. CCF is ?) in the diagram for toInc.

          Show
          Ted Yu added a comment - Thanks for the update, Alex. I get your idea, though a few arrows seem to be missing (e.g. CCF is ?) in the diagram for toInc.
          Hide
          Alex Baranau added a comment -

          Anil,

          Now that I thought about it I just realized that finding the row key to fast-forward to, when given any number of "range groups" in the fuzzy rules is quite easy. And also can be done in just one pass, by going through the bytes of the given row.

          Didn't have much time to add this functionality to the filter itself, but implemented the algorithm that seems to find the row key to fast-forward to (if you are interested to look at it). Added static method for that with small (not full) unit-test. Also attached brief description of the algo. I hope I'm not missing anything.

          Will implement the new feature of the filter as a next step.

          Show
          Alex Baranau added a comment - Anil, Now that I thought about it I just realized that finding the row key to fast-forward to, when given any number of "range groups" in the fuzzy rules is quite easy. And also can be done in just one pass , by going through the bytes of the given row. Didn't have much time to add this functionality to the filter itself, but implemented the algorithm that seems to find the row key to fast-forward to (if you are interested to look at it). Added static method for that with small (not full) unit-test. Also attached brief description of the algo. I hope I'm not missing anything. Will implement the new feature of the filter as a next step.
          Hide
          Anil Gupta added a comment -

          Hi Alex,

          I agree with you idea of RangeBased Fuzzy Filter. However, I would like to take a phased approach in developing this:
          In your proposal, the user can provide multiple fuzzy ranges in a single scan. i.e. <any 4 bytes><any 6 bytes value between "_0001" and "0099"><any 3 bytes><any 4 bytes value between "_001" and "_099">
          Instead of the above, IMO lets try to make a filter for "<any 4 bytes><any 6 bytes value between "_0001" and "0099"><any 3 bytes>" or "<any 4 bytes><any 6 bytes value between "_0001" and "0099">". Once we develop this then we can enhance it to use multiple fuzzy ranges. This is just my thought/approach of developing this. Let me know your opinion.

          From this week, at work I had to shift focus from HBase to Hive and HCatalog for another POC. So, I'll be squeezing time for this JIRA out of work schedule. I'll start looking into the current implementation of FuzzyRowFilter to get idea about implementation.

          Thanks,
          Anil Gupta
          Software Engineer II, Intuit, Inc

          Show
          Anil Gupta added a comment - Hi Alex, I agree with you idea of RangeBased Fuzzy Filter. However, I would like to take a phased approach in developing this: In your proposal, the user can provide multiple fuzzy ranges in a single scan. i.e. <any 4 bytes><any 6 bytes value between "_0001" and "0099"><any 3 bytes><any 4 bytes value between "_001" and "_099"> Instead of the above, IMO lets try to make a filter for "<any 4 bytes><any 6 bytes value between "_0001" and "0099"><any 3 bytes>" or "<any 4 bytes><any 6 bytes value between "_0001" and "0099">". Once we develop this then we can enhance it to use multiple fuzzy ranges. This is just my thought/approach of developing this. Let me know your opinion. From this week, at work I had to shift focus from HBase to Hive and HCatalog for another POC. So, I'll be squeezing time for this JIRA out of work schedule. I'll start looking into the current implementation of FuzzyRowFilter to get idea about implementation. Thanks, Anil Gupta Software Engineer II, Intuit, Inc
          Hide
          Alex Baranau added a comment -

          Sorry for the spam, for some reason I cannot edit the comment and JIRA broke formatting for the text pieces of my previous comment (I should have checked that first, sorry). This is how it supposed to look:

          Just an idea. May be we should try improve existing FuzzyRowFilter by allowing to specify each fuzzy rule with:

          • fuzzy key start
          • fuzzy key end << this is currently missing in FuzzyRowFilter
          • mask

          This looks flexible enough to me. E.g. one could specify rule ????(0001 - 0999)???(001 - 099), i.e. <any 4 bytes><any 4 bytes value between "0001" and "0999"><any 3 bytes><any 3 bytes value between "001" and "099"> with this definition:

          • ????0001???001
          • ????0999???099 << currently missing
          • 11110000111000

          In this case any sequence of "fixed" positions treated as one n-bytes value.

          Alternatively, such fuzzy rule can be specified as list of parts, each part being one of:

          • n "fuzzy" bytes
          • start/stop key part range (of the same length)

          This might be closer to "human-readable" definition, though the former one could be easier to deal with.

          Anil, as you expressed willing to work on this, what are your thoughts? May be you have smth different in your mind?

          Show
          Alex Baranau added a comment - Sorry for the spam, for some reason I cannot edit the comment and JIRA broke formatting for the text pieces of my previous comment (I should have checked that first, sorry). This is how it supposed to look: Just an idea. May be we should try improve existing FuzzyRowFilter by allowing to specify each fuzzy rule with: fuzzy key start fuzzy key end << this is currently missing in FuzzyRowFilter mask This looks flexible enough to me. E.g. one could specify rule ????(0001 - 0999)???(001 - 099), i.e. <any 4 bytes><any 4 bytes value between "0001" and "0999"><any 3 bytes><any 3 bytes value between "001" and "099"> with this definition: ????0001???001 ????0999???099 << currently missing 11110000111000 In this case any sequence of "fixed" positions treated as one n-bytes value. Alternatively, such fuzzy rule can be specified as list of parts, each part being one of: n "fuzzy" bytes start/stop key part range (of the same length) This might be closer to "human-readable" definition, though the former one could be easier to deal with. Anil, as you expressed willing to work on this, what are your thoughts? May be you have smth different in your mind?
          Hide
          Alex Baranau added a comment -

          Just an idea. May be we should try improve existing FuzzyRowFilter by allowing to specify each fuzzy rule with:

          • fuzzy key start
          • fuzzy key end << this is currently missing in FuzzyRowFilter
          • mask

          This looks flexible enough to me. E.g. one could specify rule ??(00010099)?(001_099), i.e. <any 4 bytes><any 6 bytes value between "_0001" and "0099"><any 3 bytes><any 4 bytes value between "_001" and "_099"> with this definition:

          • ??0001?_001
          • ??0099?_099 << currently missing
          • 11110000001110000

          In this case any sequence of "fixed" positions treated as one n-bytes value.


          Alternatively, such fuzzy rule can be specified as list of parts, each part being one of:

          • n "fuzzy" bytes
          • start/stop key part range (of the same length)

          This might be closer to "human-readable" definition, though the former one could be easier to deal with.

          Anil, as you expressed willing to work on this, what are your thoughts? May be you have smth different in your mind?

          Show
          Alex Baranau added a comment - Just an idea. May be we should try improve existing FuzzyRowFilter by allowing to specify each fuzzy rule with: fuzzy key start fuzzy key end << this is currently missing in FuzzyRowFilter mask This looks flexible enough to me. E.g. one could specify rule ?? ( 0001 0099 ) ?( 001 _099), i.e. <any 4 bytes><any 6 bytes value between "_0001 " and " 0099 "><any 3 bytes><any 4 bytes value between "_001" and "_099"> with this definition: ?? 0001 ?_001 ?? 0099 ?_099 << currently missing 11110000001110000 In this case any sequence of "fixed" positions treated as one n-bytes value. – Alternatively, such fuzzy rule can be specified as list of parts, each part being one of: n "fuzzy" bytes start/stop key part range (of the same length) This might be closer to "human-readable" definition, though the former one could be easier to deal with. Anil, as you expressed willing to work on this, what are your thoughts? May be you have smth different in your mind?

            People

            • Assignee:
              Alex Baranau
              Reporter:
              Alex Baranau
            • Votes:
              3 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

              • Created:
                Updated:

                Development