HBase
  1. HBase
  2. HBASE-4364

Filters applied to columns not in the selected column list are ignored

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 0.90.4, 0.92.0, 0.94.0
    • Fix Version/s: None
    • Component/s: Filters
    • Labels:
      None
    • Tags:
      Phoenix

      Description

      For a scan, if you select some set of columns using addColumns(), and then apply a SingleColumnValueFilter that restricts the results based on some other columns which aren't selected, then those filter conditions are ignored.

      1. HBASE-4364-failing-test-with-simplest-custom-filter.patch
        5 kB
        Alex Baranau
      2. HBASE-4364.patch
        34 kB
        Vasu Mariyala
      3. hbase-4364_trunk-v2.patch
        7 kB
        Jie Huang
      4. hbase-4364_trunk.patch
        7 kB
        Jie Huang

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Example shell code to reproduce this:

          create 't1', 'f1', f2'
          put 't1', 'r1', 'f1:word', 'hello'
          put 't1', 'r1', 'f2:word', 'bonjour'
          put 't1', 'r2', 'f1:word', 'goodbye'
          put 't1', 'r2', 'f2:word', 'au revoir'
          
          # scan whole table, has 2 rows, each with 2 cols
          scan 't1'
          # scan selecting only one column - returns 2 distinct rows
          scan 't1', { COLUMNS => ['f1:word'] }
          # scan with a predicate of the french word > 'b', returns 1 row
          scan 't1', { FILTER => "SingleColumnValueFilter('f2', 'word', >, 'binary:b')"  }
          # scan with a predicate of the french word > 'b', selecting only the english word
          scan 't1', { COLUMNS => ['f1:word'], FILTER => "SingleColumnValueFilter('f2', 'word', >, 'binary:b')"  }
          

          The incorrect result is as follows:

          hbase(main):008:0> scan 't1'
          ROW                                COLUMN+CELL                                                                                       
           r1                                column=f1:word, timestamp=1315608975212, value=hello                                              
           r1                                column=f2:word, timestamp=1315608975238, value=bonjour                                            
           r2                                column=f1:word, timestamp=1315608975258, value=goodbye                                            
           r2                                column=f2:word, timestamp=1315608975286, value=au revoir                                          
          2 row(s) in 0.0270 seconds
          
          hbase(main):009:0> scan 't1', { COLUMNS => ['f1:word'] }
          ROW                                COLUMN+CELL                                                                                       
           r1                                column=f1:word, timestamp=1315608975212, value=hello                                              
           r2                                column=f1:word, timestamp=1315608975258, value=goodbye                                            
          2 row(s) in 0.0140 seconds
          
          hbase(main):010:0> scan 't1', { FILTER => "SingleColumnValueFilter('f2', 'word', >, 'binary:b')"  }
          ROW                                COLUMN+CELL                                                                                       
           r1                                column=f1:word, timestamp=1315608975212, value=hello                                              
           r1                                column=f2:word, timestamp=1315608975238, value=bonjour                                            
          1 row(s) in 0.0250 seconds
          
          hbase(main):011:0> scan 't1', { COLUMNS => ['f1:word'], FILTER => "SingleColumnValueFilter('f2', 'word', >, 'binary:b')"  }
          ROW                                COLUMN+CELL                                                                                       
           r1                                column=f1:word, timestamp=1315608975212, value=hello                                              
           r2                                column=f1:word, timestamp=1315608975258, value=goodbye                                            
          2 row(s) in 0.0270 seconds <---- SHOULD NOT HAVE RETURNED ANY VALUE FOR r2!
          
          Show
          Todd Lipcon added a comment - Example shell code to reproduce this: create 't1', 'f1', f2' put 't1', 'r1', 'f1:word', 'hello' put 't1', 'r1', 'f2:word', 'bonjour' put 't1', 'r2', 'f1:word', 'goodbye' put 't1', 'r2', 'f2:word', 'au revoir' # scan whole table, has 2 rows, each with 2 cols scan 't1' # scan selecting only one column - returns 2 distinct rows scan 't1', { COLUMNS => ['f1:word'] } # scan with a predicate of the french word > 'b', returns 1 row scan 't1', { FILTER => "SingleColumnValueFilter('f2', 'word', >, 'binary:b')" } # scan with a predicate of the french word > 'b', selecting only the english word scan 't1', { COLUMNS => ['f1:word'], FILTER => "SingleColumnValueFilter('f2', 'word', >, 'binary:b')" } The incorrect result is as follows: hbase(main):008:0> scan 't1' ROW COLUMN+CELL r1 column=f1:word, timestamp=1315608975212, value=hello r1 column=f2:word, timestamp=1315608975238, value=bonjour r2 column=f1:word, timestamp=1315608975258, value=goodbye r2 column=f2:word, timestamp=1315608975286, value=au revoir 2 row(s) in 0.0270 seconds hbase(main):009:0> scan 't1', { COLUMNS => ['f1:word'] } ROW COLUMN+CELL r1 column=f1:word, timestamp=1315608975212, value=hello r2 column=f1:word, timestamp=1315608975258, value=goodbye 2 row(s) in 0.0140 seconds hbase(main):010:0> scan 't1', { FILTER => "SingleColumnValueFilter('f2', 'word', >, 'binary:b')" } ROW COLUMN+CELL r1 column=f1:word, timestamp=1315608975212, value=hello r1 column=f2:word, timestamp=1315608975238, value=bonjour 1 row(s) in 0.0250 seconds hbase(main):011:0> scan 't1', { COLUMNS => ['f1:word'], FILTER => "SingleColumnValueFilter('f2', 'word', >, 'binary:b')" } ROW COLUMN+CELL r1 column=f1:word, timestamp=1315608975212, value=hello r2 column=f1:word, timestamp=1315608975258, value=goodbye 2 row(s) in 0.0270 seconds <---- SHOULD NOT HAVE RETURNED ANY VALUE FOR r2!
          Hide
          Todd Lipcon added a comment - - edited

          Actually, it turns out this isn't due to column family pruning - the same behavior occurs even with just one column family:

          create 't2', 'f'
          put 't2', 'r1', 'f:e_word', 'hello'
          put 't2', 'r1', 'f:f_word', 'bonjour'
          put 't2', 'r2', 'f:e_word', 'goodbye'
          put 't2', 'r2', 'f:f_word', 'au revoir'
          scan 't2'
          # scan selecting only one column - returns 2 distinct rows
          scan 't2', { COLUMNS => ['f:e_word'] }
          # scan with a predicate of the french word > 'b', returns 1 row
          scan 't2', { FILTER => "SingleColumnValueFilter('f', 'f_word', >, 'binary:b')"  }
          # scan with a predicate of the french word > 'b', selecting only the english word
          scan 't2', { COLUMNS => ['f:e_word'], FILTER => "SingleColumnValueFilter('f', 'f_word', >, 'binary:b')"  }
          
          Show
          Todd Lipcon added a comment - - edited Actually, it turns out this isn't due to column family pruning - the same behavior occurs even with just one column family: create 't2', 'f' put 't2', 'r1', 'f:e_word', 'hello' put 't2', 'r1', 'f:f_word', 'bonjour' put 't2', 'r2', 'f:e_word', 'goodbye' put 't2', 'r2', 'f:f_word', 'au revoir' scan 't2' # scan selecting only one column - returns 2 distinct rows scan 't2', { COLUMNS => ['f:e_word'] } # scan with a predicate of the french word > 'b', returns 1 row scan 't2', { FILTER => "SingleColumnValueFilter('f', 'f_word', >, 'binary:b')" } # scan with a predicate of the french word > 'b', selecting only the english word scan 't2', { COLUMNS => ['f:e_word'], FILTER => "SingleColumnValueFilter('f', 'f_word', >, 'binary:b')" }
          Hide
          Todd Lipcon added a comment -

          Updated description to reflect the above: this is a general issue, not related to CFs.

          Show
          Todd Lipcon added a comment - Updated description to reflect the above: this is a general issue, not related to CFs.
          Hide
          Todd Lipcon added a comment -

          Apparently this is actually known behavior according to SingleColumnValueFilter. From the JavaDoc:

          When using this filter on a {@link Scan} with specified
           * inputs, the column to be tested should also be added as input (otherwise
           * the filter will regard the column as missing).
          

          IMO, it's a bug, though, not a feature! Filters with requirements like this should automatically push their column requirements through to the ExplicitColumnTracker.

          Show
          Todd Lipcon added a comment - Apparently this is actually known behavior according to SingleColumnValueFilter. From the JavaDoc: When using this filter on a {@link Scan} with specified * inputs, the column to be tested should also be added as input (otherwise * the filter will regard the column as missing). IMO, it's a bug, though, not a feature! Filters with requirements like this should automatically push their column requirements through to the ExplicitColumnTracker.
          Hide
          Jie Huang added a comment -

          We have met the similar problem as well. After checking the code implementation and the JavaDoc, it may not be a bug from my perspective. According to its description, if the input doesn't include that column defined in the filter, that column will be regarded as missing. If filterIfMissing is true, that row will be omitted. Otherwise, you will get the row without filtering.

          Generally, in real world application, we'd better to check that case on the client side before issuing the scan. In order to avoid the repetition of checking missing column on client side, we do a little modification on that file while using it (see patch file).

          Show
          Jie Huang added a comment - We have met the similar problem as well. After checking the code implementation and the JavaDoc, it may not be a bug from my perspective. According to its description, if the input doesn't include that column defined in the filter, that column will be regarded as missing. If filterIfMissing is true, that row will be omitted. Otherwise, you will get the row without filtering. Generally, in real world application, we'd better to check that case on the client side before issuing the scan. In order to avoid the repetition of checking missing column on client side, we do a little modification on that file while using it (see patch file).
          Hide
          Jie Huang added a comment -

          In order to get the configured columnfamily and qualifier in the SingleColumnValueFilter, we'd better add two get functions in the class.

          And one coprocessor sample is added in this patch, which will add that configured qualifier into the tested set by default.

          The current patch file is against to 0.94.0 release version.

          Show
          Jie Huang added a comment - In order to get the configured columnfamily and qualifier in the SingleColumnValueFilter, we'd better add two get functions in the class. And one coprocessor sample is added in this patch, which will add that configured qualifier into the tested set by default. The current patch file is against to 0.94.0 release version.
          Hide
          Jie Huang added a comment -

          Sorry for mis-click the wrong button. I should supply a sample file only.

          Show
          Jie Huang added a comment - Sorry for mis-click the wrong button. I should supply a sample file only.
          Hide
          Jie Huang added a comment -

          Sample file for workaround

          Show
          Jie Huang added a comment - Sample file for workaround
          Hide
          Jie Huang added a comment -

          One possible solution for the feature:
          1. Add sanityCheck() API in Filter. And call the check function in getScanner(). So the filter can check the scan and its criteria before hand.
          2. In SingleColumnValueFilter, if the tested set doesn't include the criteria column, add it in sanityCheck(). Afterwards, remove the column in filterRow(kv), so that the user will get all required columns.

          I will add one corresponding testcase and do the verification later. Any comment, please let me know. thanks.

          Show
          Jie Huang added a comment - One possible solution for the feature: 1. Add sanityCheck() API in Filter. And call the check function in getScanner(). So the filter can check the scan and its criteria before hand. 2. In SingleColumnValueFilter, if the tested set doesn't include the criteria column, add it in sanityCheck(). Afterwards, remove the column in filterRow(kv), so that the user will get all required columns. I will add one corresponding testcase and do the verification later. Any comment, please let me know. thanks.
          Hide
          Jie Huang added a comment -

          Provide the patch of one possible solution for this feature. I wonder if it is acceptable.

          Show
          Jie Huang added a comment - Provide the patch of one possible solution for this feature. I wonder if it is acceptable.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12533771/hbase-4364-0_94_0.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2276//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12533771/hbase-4364-0_94_0.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2276//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12533774/hbase-4364-0_94_0.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2277//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12533774/hbase-4364-0_94_0.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2277//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12533780/hbase-4364-0_94_0.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2278//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12533780/hbase-4364-0_94_0.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2278//console This message is automatically generated.
          Hide
          JQ Hadoop added a comment -

          Looks like a useful fix to me - the current behavior is very confusing to any average SQL user. Can anyone confirm if this is a valid patch?

          Show
          JQ Hadoop added a comment - Looks like a useful fix to me - the current behavior is very confusing to any average SQL user. Can anyone confirm if this is a valid patch?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12536340/hbase-4364_trunk.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings).

          -1 findbugs. The patch appears to introduce 10 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.master.TestSplitLogManager
          org.apache.hadoop.hbase.catalog.TestMetaReaderEditor

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2382//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2382//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2382//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2382//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12536340/hbase-4364_trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 10 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestSplitLogManager org.apache.hadoop.hbase.catalog.TestMetaReaderEditor Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2382//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2382//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2382//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2382//console This message is automatically generated.
          Hide
          Alex Baranau added a comment -

          Attached patch with simple custom filter unit-test that fails (which verifies the issue).

          Show
          Alex Baranau added a comment - Attached patch with simple custom filter unit-test that fails (which verifies the issue).
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12541070/HBASE-4364-failing-test-with-simplest-custom-filter.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings).

          -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.filter.TestFilterOnSelectedColumns
          org.apache.hadoop.hbase.TestDrainingServer

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12541070/HBASE-4364-failing-test-with-simplest-custom-filter.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.filter.TestFilterOnSelectedColumns org.apache.hadoop.hbase.TestDrainingServer Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2588//console This message is automatically generated.
          Hide
          Lucas Bernardi added a comment -

          Hi there, we're facing a similar issue. But I guess the first question would be, what is the expected behavior for the treatment of requested columns and filtered columns?
          One possible use case is when I want to get a specific set of columns, but filter the row using another set of columns. For this use case it looks like the Filter should always be applied to all columns, regardless of the columns requested by the client. This would be SQL-like 'where' and 'projection'.
          But, hbase is supposed to be able to handle tons of columns, so performance wise, the SQL-like behavior doesn't look like a good idea to me.
          I think the behavior should be: filter is only applied on requested columns, a more bigdata oriented behavior. Or, may be add the posibility to specify, a set of 'projected' columns and another set of filtered 'columns'.

          Anyway, it looks like 0.92.1 does something really weird. It will apply the filter on the first column and the 'projected' columns. This is closer to the bigdata oriented behavior, but with a bug, so, I think it should be fixed first, meaning just apply the filter to requested columns vs first column + requested columns. Once that works, we can think about filtering based on other columns.

          Show
          Lucas Bernardi added a comment - Hi there, we're facing a similar issue. But I guess the first question would be, what is the expected behavior for the treatment of requested columns and filtered columns? One possible use case is when I want to get a specific set of columns, but filter the row using another set of columns. For this use case it looks like the Filter should always be applied to all columns, regardless of the columns requested by the client. This would be SQL-like 'where' and 'projection'. But, hbase is supposed to be able to handle tons of columns, so performance wise, the SQL-like behavior doesn't look like a good idea to me. I think the behavior should be: filter is only applied on requested columns, a more bigdata oriented behavior. Or, may be add the posibility to specify, a set of 'projected' columns and another set of filtered 'columns'. Anyway, it looks like 0.92.1 does something really weird. It will apply the filter on the first column and the 'projected' columns. This is closer to the bigdata oriented behavior, but with a bug, so, I think it should be fixed first, meaning just apply the filter to requested columns vs first column + requested columns. Once that works, we can think about filtering based on other columns.
          Hide
          Ted Yu added a comment -

          @Lucas:
          Makes sense.
          Any chance of a patch for trunk ?

          Thanks

          Show
          Ted Yu added a comment - @Lucas: Makes sense. Any chance of a patch for trunk ? Thanks
          Hide
          Lucas Bernardi added a comment -

          I'll try to look at it, any reccommendation on classes to start?

          Show
          Lucas Bernardi added a comment - I'll try to look at it, any reccommendation on classes to start?
          Hide
          Ted Yu added a comment -

          I just found that Grace provided a patch some time ago.
          But the patch doesn't compile:

          [ERROR] /Users/zhihyu/trunk-hbase/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/FilterWrapper.java:[41,7] org.apache.hadoop.hbase.filter.FilterWrapper is not abstract and does not override abstract method sanityCheck(org.apache.hadoop.hbase.client.Scan) in org.apache.hadoop.hbase.filter.Filter
          

          @Lucas:
          You can use that as a starting point.

          @Grace:
          You are welcome to refine your patch.

          Show
          Ted Yu added a comment - I just found that Grace provided a patch some time ago. But the patch doesn't compile: [ERROR] /Users/zhihyu/trunk-hbase/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/FilterWrapper.java:[41,7] org.apache.hadoop.hbase.filter.FilterWrapper is not abstract and does not override abstract method sanityCheck(org.apache.hadoop.hbase.client.Scan) in org.apache.hadoop.hbase.filter.Filter @Lucas: You can use that as a starting point. @Grace: You are welcome to refine your patch.
          Hide
          Jie Huang added a comment -

          Yes. I will refine my patch against the trunk.

          Show
          Jie Huang added a comment - Yes. I will refine my patch against the trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12541179/hbase-4364_trunk-v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings).

          -1 findbugs. The patch appears to introduce 11 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.master.TestSplitLogManager

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12541179/hbase-4364_trunk-v2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 11 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2597//console This message is automatically generated.
          Hide
          Jie Huang added a comment -

          I have updated the patch file against the latest trunk version. This patch only aims to solve the problem which is stated in this bug entry. So that the user can do sth. like

          create 't2', 'f'
          put 't2', 'r1', 'f:e_word', 'hello'
          put 't2', 'r1', 'f:f_word', 'bonjour'
          put 't2', 'r2', 'f:e_word', 'goodbye'
          put 't2', 'r2', 'f:f_word', 'au revoir'
          # scan with a predicate of the french word > 'b', selecting only the english word
          # it only returns "r1 column=f:e_word, value=hello" instead of 
          # "r1 column=f:e_word, value=hello; r2 column=f:e_word, value=goodbye"
          scan 't2', { COLUMNS => ['f:e_word'], FILTER => "SingleColumnValueFilter('f', 'f_word', >, 'binary:b')"  }
          

          From my perspective, this case is different with that case stated in the unit-test file. Here is a simple solution to make that unit-test case passed.

           private static class ValueHasMoreThan2BytesFilter extends FilterBase {
              @Override
              public ReturnCode filterKeyValue(KeyValue kv) {
                byte[] val = kv.getValue();
                if (val.length > 2) {
                  return ReturnCode.INCLUDE;
                } else {
                  return ReturnCode.NEXT_ROW; // it should be "return ReturnCode.SKIP;"
                }
              }
          

          Then we can get the expected results here.

          Show
          Jie Huang added a comment - I have updated the patch file against the latest trunk version. This patch only aims to solve the problem which is stated in this bug entry. So that the user can do sth. like create 't2', 'f' put 't2', 'r1', 'f:e_word', 'hello' put 't2', 'r1', 'f:f_word', 'bonjour' put 't2', 'r2', 'f:e_word', 'goodbye' put 't2', 'r2', 'f:f_word', 'au revoir' # scan with a predicate of the french word > 'b', selecting only the english word # it only returns "r1 column=f:e_word, value=hello" instead of # "r1 column=f:e_word, value=hello; r2 column=f:e_word, value=goodbye" scan 't2', { COLUMNS => ['f:e_word'], FILTER => "SingleColumnValueFilter('f', 'f_word', >, 'binary:b')" } From my perspective, this case is different with that case stated in the unit-test file. Here is a simple solution to make that unit-test case passed. private static class ValueHasMoreThan2BytesFilter extends FilterBase { @Override public ReturnCode filterKeyValue(KeyValue kv) { byte [] val = kv.getValue(); if (val.length > 2) { return ReturnCode.INCLUDE; } else { return ReturnCode.NEXT_ROW; // it should be " return ReturnCode.SKIP;" } } Then we can get the expected results here.
          Hide
          Alex Baranau added a comment -

          return ReturnCode.NEXT_ROW; // it should be "return ReturnCode.SKIP;"
          

          Changing to SKIP would help to get the correct results for sure. But I'd highly doubt that in this case it will work correct internally. I.e. I believe it would still apply filter to all columns, but because any of these applications of filter doesn't affect the decision on whether include the whole row or not this works. But in the essence (applying filter to all columns instead of selected) this might work incorrect internally.

          That's why I added this unit-test, which actually verifies that filter applied to selected columns only (and decisions about filtering rows are based on filters applied to selected columns only).

          Show
          Alex Baranau added a comment - return ReturnCode.NEXT_ROW; // it should be "return ReturnCode.SKIP;" Changing to SKIP would help to get the correct results for sure. But I'd highly doubt that in this case it will work correct internally. I.e. I believe it would still apply filter to all columns, but because any of these applications of filter doesn't affect the decision on whether include the whole row or not this works. But in the essence (applying filter to all columns instead of selected) this might work incorrect internally. That's why I added this unit-test, which actually verifies that filter applied to selected columns only (and decisions about filtering rows are based on filters applied to selected columns only).
          Hide
          Alex Baranau added a comment -

          @Jie Huang Could you please create review on the review board?

          Show
          Alex Baranau added a comment - @ Jie Huang Could you please create review on the review board?
          Hide
          Jie Huang added a comment -

          @Alex, totally understand your point. Yes. What you have stated in the unit-test is a problem. Not only for the functionality, but also for the better performance as Lucas mentioned. We'd better to feed all those selected columns to the filter.

          However, to make the filter only work on those selected columns doesn't help to solve the original problem here. The problem is sth. like select A from table where B > 'b'. If we only applies the filtering criteria on column A as what we expected in the unit-test case, we will get empty result here. So basically, the solution proposed here is

          1. To add filterred column into the selected columns, and then do the filter. The sanityCheck() function gives the chance to combine "projected" and "where" if necessary.
          2. To remove that newly added column before returning back to the end-user. Since that newly added column is not expected by the end-user.

          After making the filter only work on those required columns, using suck kind of solution, we still can extend the "projected" by adding those "where" columns. And the filter may apply on "required" + "criteria column" only, instead of going through the whole column list.

          How about to create a new issue about the "only feeds those selected columns to the filter". The initial thought is to do some comparison work in the matcher. Any comment?

          Show
          Jie Huang added a comment - @Alex, totally understand your point. Yes. What you have stated in the unit-test is a problem. Not only for the functionality, but also for the better performance as Lucas mentioned. We'd better to feed all those selected columns to the filter. However, to make the filter only work on those selected columns doesn't help to solve the original problem here. The problem is sth. like select A from table where B > 'b' . If we only applies the filtering criteria on column A as what we expected in the unit-test case, we will get empty result here. So basically, the solution proposed here is To add filterred column into the selected columns, and then do the filter. The sanityCheck() function gives the chance to combine "projected" and "where" if necessary. To remove that newly added column before returning back to the end-user. Since that newly added column is not expected by the end-user. After making the filter only work on those required columns, using suck kind of solution, we still can extend the "projected" by adding those "where" columns. And the filter may apply on "required" + "criteria column" only, instead of going through the whole column list. How about to create a new issue about the "only feeds those selected columns to the filter". The initial thought is to do some comparison work in the matcher. Any comment?
          Hide
          Lucas Bernardi added a comment -

          How about specifying filterColumns explictly on the client side?
          So if i want to do something equivalent to select a, b from x where c>0 and d<1 in hbase it would be
          get.addColumn(a);
          get.addColumn(b);
          get.addFilterColumn(c);
          get.addFilterColumn(d);

          So the filter would only be applied on keyvalues for columns c and d.
          If no filetrColumn is specified, then the filter would be applied on projected columns only.

          Show
          Lucas Bernardi added a comment - How about specifying filterColumns explictly on the client side? So if i want to do something equivalent to select a, b from x where c>0 and d<1 in hbase it would be get.addColumn(a); get.addColumn(b); get.addFilterColumn(c); get.addFilterColumn(d); So the filter would only be applied on keyvalues for columns c and d. If no filetrColumn is specified, then the filter would be applied on projected columns only.
          Hide
          Alex Baranau added a comment -

          @Jie Yeah, seems you are right. The problem I'm referring to is a bit different.

          Now that I re-read the issue description I think that two problems could be contradictory:
          1. In this issue it is stated that filters should be applied to the columns not in the selected list
          2. In the problem I'm referring to states that filters should not be applied to the columns in the selected list

          May be what I'm pointing out with the unit-test is really not a bug, but "designed to be" so? Though it might be not, because it would make scanner to always fetch (physically from the storage) even those columns we are not interested in if filters are added to the scan. Which might add a lot of unnecessary work (currently - when data is in different columnfamilies, or in future for same CF if the storage format will be improved somehow that will allow avoid reading columns that are not requested from single CF). I'd say it may be better to require users to specify explicitly columns to which the filters should be applied iff they are not in the selected list. In case of SingleColumnValueFilter, we may say that user specifies the column explicitly. I.e. we should also have ability, (or do we already?) for filter to add a column to that "apply filter to but not transfer to the user" column list.

          Show
          Alex Baranau added a comment - @Jie Yeah, seems you are right. The problem I'm referring to is a bit different. Now that I re-read the issue description I think that two problems could be contradictory: 1. In this issue it is stated that filters should be applied to the columns not in the selected list 2. In the problem I'm referring to states that filters should not be applied to the columns in the selected list May be what I'm pointing out with the unit-test is really not a bug, but "designed to be" so? Though it might be not, because it would make scanner to always fetch (physically from the storage) even those columns we are not interested in if filters are added to the scan. Which might add a lot of unnecessary work (currently - when data is in different columnfamilies, or in future for same CF if the storage format will be improved somehow that will allow avoid reading columns that are not requested from single CF). I'd say it may be better to require users to specify explicitly columns to which the filters should be applied iff they are not in the selected list. In case of SingleColumnValueFilter, we may say that user specifies the column explicitly. I.e. we should also have ability, (or do we already?) for filter to add a column to that "apply filter to but not transfer to the user" column list.
          Hide
          Ted Yu added a comment -

          @Alex:
          Take a look at SingleColumnValueExcludeFilter

          Show
          Ted Yu added a comment - @Alex: Take a look at SingleColumnValueExcludeFilter
          Hide
          Lucas Bernardi added a comment -

          Yeah, I agree with Alex, the problem I se is that now, the code it somewhere in the middle, the filter is applied to the first column, and selected columns. I'm trying to figure out how to fix that, but I'm having trouble to find whi the first column is passed to the filter.

          Show
          Lucas Bernardi added a comment - Yeah, I agree with Alex, the problem I se is that now, the code it somewhere in the middle, the filter is applied to the first column, and selected columns. I'm trying to figure out how to fix that, but I'm having trouble to find whi the first column is passed to the filter.
          Hide
          Alex Baranau added a comment -

          RE SingleColumnValueExcludeFilter - then why this issue is a bug again? I believe this is the correct behavior (based on the thoughts in my last comment). Then one need to include the column is select columns list and use SingleColumnValueExcludeFilter.

          At the same time (different story/effort) we may want to make this "apply filter but not transfer back to user" feature be available in some more generic way. May be some filter wrapper or so.

          Show
          Alex Baranau added a comment - RE SingleColumnValueExcludeFilter - then why this issue is a bug again? I believe this is the correct behavior (based on the thoughts in my last comment). Then one need to include the column is select columns list and use SingleColumnValueExcludeFilter. At the same time (different story/effort) we may want to make this "apply filter but not transfer back to user" feature be available in some more generic way. May be some filter wrapper or so.
          Hide
          Jie Huang added a comment -

          Actually, I know we can use SingleColumnValueExcludeFilter to solve the problem somehow. That is why I said this is not a bug before. The document already hints to include both "required" and "criteria" columns altogether. Otherwise, the criteria column will be regarded as missing part in the filter. The correct way is:

           scan 't2', { COLUMNS => ['f:e_word', 'f:f_word'], FILTER => "SingleColumnValueExcludeFilter('f', 'f_word', >, 'binary:b')"  }
          

          But, SingleColumnValueExcludeFilter only helps to not return that filter's "criteria column". We still need to specify both "e_word" and "f_word" columns in the "required" list. Otherwise, we still get the wrong result like this:

          scan 't2', { COLUMNS => ['f:e_word'], FILTER => "SingleColumnValueExcludeFilter('f', 'f_word', >, 'binary:b')"  }
          ROW                            COLUMN+CELL
           r1                            column=f:e_word, timestamp=1345084564838, value=hello
           r2                            column=f:e_word, timestamp=1345084564895, value=goodbye
          

          The points of that proposal in the patch file are :

          1. select A from table where B>'b' is more acceptable by SQL user rather than select A exclusively (B) from table where B>'b'
          2. To add one chance to combine "required" and "criteria" columns before doing the filtering work, which may be used by other filter later. It is not only for SingleColumnValueFilter.

          For the other problem in the unit-test, shall we create a new issue and have further discussion there?

          Show
          Jie Huang added a comment - Actually, I know we can use SingleColumnValueExcludeFilter to solve the problem somehow. That is why I said this is not a bug before. The document already hints to include both "required" and "criteria" columns altogether. Otherwise, the criteria column will be regarded as missing part in the filter. The correct way is: scan 't2', { COLUMNS => ['f:e_word', 'f:f_word'], FILTER => "SingleColumnValueExcludeFilter('f', 'f_word', >, 'binary:b')" } But, SingleColumnValueExcludeFilter only helps to not return that filter's "criteria column". We still need to specify both "e_word" and "f_word" columns in the "required" list. Otherwise, we still get the wrong result like this: scan 't2', { COLUMNS => ['f:e_word'], FILTER => "SingleColumnValueExcludeFilter('f', 'f_word', >, 'binary:b')" } ROW COLUMN+CELL r1 column=f:e_word, timestamp=1345084564838, value=hello r2 column=f:e_word, timestamp=1345084564895, value=goodbye The points of that proposal in the patch file are : select A from table where B>'b' is more acceptable by SQL user rather than select A exclusively (B) from table where B>'b' To add one chance to combine "required" and "criteria" columns before doing the filtering work, which may be used by other filter later. It is not only for SingleColumnValueFilter. For the other problem in the unit-test, shall we create a new issue and have further discussion there?
          Hide
          Jie Huang added a comment -

          Alex Baranau, Lucas Bernardi I have checked those code related to the problem you mentioned. According to the comments, it seems that this topic has been discussed before. The conclusion is that

          /**
               * Filters should be checked before checking column trackers. If we do
               * otherwise, as was previously being done, ColumnTracker may increment its
               * counter for even that KV which may be discarded later on by Filter. This
               * would lead to incorrect results in certain cases.
               */
              if (filter != null) {
                ReturnCode filterResponse = filter.filterKeyValue(kv);
                if (filterResponse == ReturnCode.SKIP) {
                  return MatchCode.SKIP;
                } else if (filterResponse == ReturnCode.NEXT_COL) {
                  return columns.getNextRowOrNextColumn(bytes, offset, qualLength);
                } else if (filterResponse == ReturnCode.NEXT_ROW) {
                  stickyNextRow = true;
                  return MatchCode.SEEK_NEXT_ROW;
                } else if (filterResponse == ReturnCode.SEEK_NEXT_USING_HINT) {
                  return MatchCode.SEEK_NEXT_USING_HINT;
                }
              }
          
              MatchCode colChecker = columns.checkColumn(bytes, offset, qualLength,
                  timestamp, type, kv.getMemstoreTS() > maxReadPointToTrackVersions);
          

          If both of you are still interested in this problem, we may try to figure out some potential solution. Any comment?

          Show
          Jie Huang added a comment - Alex Baranau , Lucas Bernardi I have checked those code related to the problem you mentioned. According to the comments, it seems that this topic has been discussed before. The conclusion is that /** * Filters should be checked before checking column trackers. If we do * otherwise, as was previously being done, ColumnTracker may increment its * counter for even that KV which may be discarded later on by Filter. This * would lead to incorrect results in certain cases. */ if (filter != null ) { ReturnCode filterResponse = filter.filterKeyValue(kv); if (filterResponse == ReturnCode.SKIP) { return MatchCode.SKIP; } else if (filterResponse == ReturnCode.NEXT_COL) { return columns.getNextRowOrNextColumn(bytes, offset, qualLength); } else if (filterResponse == ReturnCode.NEXT_ROW) { stickyNextRow = true ; return MatchCode.SEEK_NEXT_ROW; } else if (filterResponse == ReturnCode.SEEK_NEXT_USING_HINT) { return MatchCode.SEEK_NEXT_USING_HINT; } } MatchCode colChecker = columns.checkColumn(bytes, offset, qualLength, timestamp, type, kv.getMemstoreTS() > maxReadPointToTrackVersions); If both of you are still interested in this problem, we may try to figure out some potential solution. Any comment?
          Hide
          Jean-Daniel Cryans added a comment -

          Unmarking patch available, the conversation died.

          Show
          Jean-Daniel Cryans added a comment - Unmarking patch available, the conversation died.
          Hide
          Vasu Mariyala added a comment -

          I have attached the patch HBASE-4364.patch which provides the support for the filtering to be based on different column qualifiers/column families when compared to the selection list (scan.getFamilyMap()). This feature reduces the amount of data that is sent to the client in different scenarios.

          On a high level, the patch does the following

          a) Filter communicates its needs of the column qualifiers/column families through getFamilyMap() method of Filter. This method has been added in this patch.

          b) While selecting the store scanners, the requirement from the filters is considered as well.

          c) ScanQueryMatcher has been changed to apply the filter only if it is part of the required column qualifiers specified by the filter. It has also been changed to include a key value only if it is part of the scan family map.

          d) The change was implemented in a back ward compatible manner. The behavior of the existing filters would not be affected.

          Please review the first version of the patch and provide your valuable inputs on the approach that was used.

          If the approach is ok, then I will continue to work on refining the code and changing the existing filters like SingleColumnValueFilter to use the new approach.

          + Lars Hofhansl to comment on this

          Show
          Vasu Mariyala added a comment - I have attached the patch HBASE-4364 .patch which provides the support for the filtering to be based on different column qualifiers/column families when compared to the selection list (scan.getFamilyMap()). This feature reduces the amount of data that is sent to the client in different scenarios. On a high level, the patch does the following a) Filter communicates its needs of the column qualifiers/column families through getFamilyMap() method of Filter. This method has been added in this patch. b) While selecting the store scanners, the requirement from the filters is considered as well. c) ScanQueryMatcher has been changed to apply the filter only if it is part of the required column qualifiers specified by the filter. It has also been changed to include a key value only if it is part of the scan family map. d) The change was implemented in a back ward compatible manner. The behavior of the existing filters would not be affected. Please review the first version of the patch and provide your valuable inputs on the approach that was used. If the approach is ok, then I will continue to work on refining the code and changing the existing filters like SingleColumnValueFilter to use the new approach. + Lars Hofhansl to comment on this
          Hide
          Ted Yu added a comment -
          +        // Select clause has its own column qualifiers
          

          Better replace 'Select clause' with other term.

          Approach looks feasible.
          Have you benchmarked the performance of the patch ?

          Please put next patch on review board.

          Show
          Ted Yu added a comment - + // Select clause has its own column qualifiers Better replace 'Select clause' with other term. Approach looks feasible. Have you benchmarked the performance of the patch ? Please put next patch on review board.
          Hide
          Vasu Mariyala added a comment -

          Thanks Ted Yu for your comments. Before doing the performance benchmarking, I wanted to get some comments on the approach. I would work on it and post the observations.

          Sure, I would post the next patch on the review board.

          Show
          Vasu Mariyala added a comment - Thanks Ted Yu for your comments. Before doing the performance benchmarking, I wanted to get some comments on the approach. I would work on it and post the observations. Sure, I would post the next patch on the review board.
          Hide
          stack added a comment -

          Log of failed test.

          Show
          stack added a comment - Log of failed test.

            People

            • Assignee:
              Unassigned
              Reporter:
              Todd Lipcon
            • Votes:
              1 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:

                Development