HBase
  1. HBase
  2. HBASE-2438

Addition of a Column Pagination Filter

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.3
    • Fix Version/s: 0.90.0
    • Component/s: Filters
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
       * A filter, based on the ColumnCountGetFilter, takes two arguments: limit and offset.
       * This filter can be used for row-based indexing, where references to other tables are stored across many columns, in order to efficient lookups and paginated results for end users.
      Show
       * A filter, based on the ColumnCountGetFilter, takes two arguments: limit and offset.  * This filter can be used for row-based indexing, where references to other tables are stored across many columns, in order to efficient lookups and paginated results for end users.
    • Tags:
      pagination, "row-based indexing", filter

      Description

      Client applications may need to do pagination, depending on the number of columns returned, it may be more efficient to perform pagination algorithms at the database level (similar to SQL's LIMIT and OFFSET). This will be an additional filter taking two parameters:

      • page
      • pageSize

      For every row, that gets returned, only a subset of columns are returned based on page and pageSize

      If the page / pageSize column goes over the limits, then no results are returned from the filter.

      A practical example for using a filter like this may be for folks doing Row-based indexing with Hbase.

        Activity

        Paul Kist created issue -
        Hide
        Andrew Purtell added a comment -

        Paul, are you planning to provide a patch for this issue?

        Show
        Andrew Purtell added a comment - Paul, are you planning to provide a patch for this issue?
        Andrew Purtell made changes -
        Field Original Value New Value
        Fix Version/s 0.20.3 [ 12314357 ]
        Hide
        Paul Kist added a comment -

        Andrew, yes, I'm working on a patch for this issue. Just wanted to present the idea to the community before I submitted it. It's almost finished.
        I've expanded the TestFilter unit tests to test this new Filter as well.

        Show
        Paul Kist added a comment - Andrew, yes, I'm working on a patch for this issue. Just wanted to present the idea to the community before I submitted it. It's almost finished. I've expanded the TestFilter unit tests to test this new Filter as well.
        Hide
        Jonathan Gray added a comment -

        @Paul, this would be for # of columns in a row, not # of rows?

        Show
        Jonathan Gray added a comment - @Paul, this would be for # of columns in a row, not # of rows?
        Hide
        Paul Kist added a comment -

        Jonathan,

        That is correct. Column pagination and row pagination need to be handled separately. I had initially looked at augmenting the PageFilter, however the issue with row pagination at the Filter level is the fact that the Filters run within the context of a single region server, unaware of any other rows that may be returned from other Filters on another region. That being the case, Row pagination is a different ballgame altogether, so I decided to keep the functionality of this Filter specific to columns.

        Show
        Paul Kist added a comment - Jonathan, That is correct. Column pagination and row pagination need to be handled separately. I had initially looked at augmenting the PageFilter, however the issue with row pagination at the Filter level is the fact that the Filters run within the context of a single region server, unaware of any other rows that may be returned from other Filters on another region. That being the case, Row pagination is a different ballgame altogether, so I decided to keep the functionality of this Filter specific to columns.
        Hide
        George P. Stathis added a comment -

        Jonathan, this proposed feature works well with https://issues.apache.org/jira/browse/HBASE-2426. It does not require it but it makes a lot of sense when used in conjunction. Hope this helps shed some light on what we are up to.

        Show
        George P. Stathis added a comment - Jonathan, this proposed feature works well with https://issues.apache.org/jira/browse/HBASE-2426 . It does not require it but it makes a lot of sense when used in conjunction. Hope this helps shed some light on what we are up to.
        Hide
        Jonathan Gray added a comment -

        Just wanted to clarify. This all sounds good guys. You're right, we currently don't fully support stateful scanners as they cannot ride across regions. This has been explored but not yet implemented.

        Show
        Jonathan Gray added a comment - Just wanted to clarify. This all sounds good guys. You're right, we currently don't fully support stateful scanners as they cannot ride across regions. This has been explored but not yet implemented.
        Hide
        Paul Kist added a comment -

        Patch containing the new ColumnPaginationFilter, as well as updates to the unit test.

        Show
        Paul Kist added a comment - Patch containing the new ColumnPaginationFilter, as well as updates to the unit test.
        Paul Kist made changes -
        Attachment hbase-2438-0.20.3.patch [ 12441650 ]
        Hide
        Jonathan Gray added a comment -

        Patch looks pretty good.

        One problem I see is that when you deserialize you set page and pageSize, but the offset (which is used) is not calculated like it is in the non-empty constructor. So looks like this will break once serialized (if I read the code correctly). Is there a way we can also have a unit test which would show that?

        Also, in the main filter call:

        + public ReturnCode filterKeyValue(KeyValue v)
        + {
        + ReturnCode code = (count < offset || count >= offset + pageSize) ? ReturnCode.SKIP : ReturnCode.INCLUDE;
        + count++;
        + return code;
        + }
        

        This code is a bit hard to read. Also, once you are passed the offset + pageSize, shouldn't you be sending the ReturnCode.NEXT_ROW since you don't want to include anything more in the current row? Count isn't used once you get passed the offset either so maybe this could be rewritten more like:

        + public ReturnCode filterKeyValue(KeyValue v)
        + {
        + if(count >= offset + pageSize) {
        + return ReturnCode.NEXT_ROW;
        + }
        + ReturnCode code = count < offset ? ReturnCode.SKIP : ReturnCode.INCLUDE;
        + count++;
        + return code;
        + }
        

        Good stuff guys.

        Show
        Jonathan Gray added a comment - Patch looks pretty good. One problem I see is that when you deserialize you set page and pageSize, but the offset (which is used) is not calculated like it is in the non-empty constructor. So looks like this will break once serialized (if I read the code correctly). Is there a way we can also have a unit test which would show that? Also, in the main filter call: + public ReturnCode filterKeyValue(KeyValue v) + { + ReturnCode code = (count < offset || count >= offset + pageSize) ? ReturnCode.SKIP : ReturnCode.INCLUDE; + count++; + return code; + } This code is a bit hard to read. Also, once you are passed the offset + pageSize, shouldn't you be sending the ReturnCode.NEXT_ROW since you don't want to include anything more in the current row? Count isn't used once you get passed the offset either so maybe this could be rewritten more like: + public ReturnCode filterKeyValue(KeyValue v) + { + if(count >= offset + pageSize) { + return ReturnCode.NEXT_ROW; + } + ReturnCode code = count < offset ? ReturnCode.SKIP : ReturnCode.INCLUDE; + count++; + return code; + } Good stuff guys.
        Hide
        Jonathan Gray added a comment -

        Added Paul Kist and George Stathis as contributors. Assigning to Paul.

        Show
        Jonathan Gray added a comment - Added Paul Kist and George Stathis as contributors. Assigning to Paul.
        Jonathan Gray made changes -
        Assignee Paul Kist [ kida78 ]
        Hide
        Paul Kist added a comment -

        Jonathan,

        Thanks for the performance tip, I'll definitely include that.
        Regarding the readFields and write functions, so you're saying that in readFields, i'll have to also calculate offset (because that's what's being called by the serializer which calls the empty constructor)?

        Could you point me to an example of how I can test the serialization of a filter? Thank you

        Show
        Paul Kist added a comment - Jonathan, Thanks for the performance tip, I'll definitely include that. Regarding the readFields and write functions, so you're saying that in readFields, i'll have to also calculate offset (because that's what's being called by the serializer which calls the empty constructor)? Could you point me to an example of how I can test the serialization of a filter? Thank you
        Hide
        George P. Stathis added a comment -

        Got it: TestSingleColumnValueFilter.testSerialization(). We'll do something like that. The ReturnCode.NEXT_ROW tip was great.

        Show
        George P. Stathis added a comment - Got it: TestSingleColumnValueFilter.testSerialization(). We'll do something like that. The ReturnCode.NEXT_ROW tip was great.
        Hide
        Jonathan Gray added a comment -

        That's right. When the deserialization happens, it calls the empty constructor and then readFields. This will leave offset as 0, so you need to calculate offset in the readFields method.

        Show
        Jonathan Gray added a comment - That's right. When the deserialization happens, it calls the empty constructor and then readFields. This will leave offset as 0, so you need to calculate offset in the readFields method.
        Hide
        Jonathan Gray added a comment -

        I would actually argue for switching the API to limit, offset rather than page, perPage. The latter is a subset of the former as far as supported behavior and some people may have use cases for the use of offset directly rather than fixed page sizes.

        Show
        Jonathan Gray added a comment - I would actually argue for switching the API to limit, offset rather than page, perPage. The latter is a subset of the former as far as supported behavior and some people may have use cases for the use of offset directly rather than fixed page sizes.
        Hide
        Paul Kist added a comment -

        Jonathan, I made the suggested changes, as it would allow for a more generic use of the Filter. Also, I added a unit test "TestColumnPaginationFilter" which basically tests the serialization and performs a simple filter example (since most of the filter testing is in the TestFilters unit test class. Will post the patch

        Show
        Paul Kist added a comment - Jonathan, I made the suggested changes, as it would allow for a more generic use of the Filter. Also, I added a unit test "TestColumnPaginationFilter" which basically tests the serialization and performs a simple filter example (since most of the filter testing is in the TestFilters unit test class. Will post the patch
        Hide
        Paul Kist added a comment -

        Updated patch based on user comments discussion. Ready for code review

        Show
        Paul Kist added a comment - Updated patch based on user comments discussion. Ready for code review
        Paul Kist made changes -
        Attachment hbase-2438-0.20.3-v2.patch [ 12441747 ]
        Hide
        Jonathan Gray added a comment -

        Patch looks good, just some nitpicks.

        • Take out cleanup changes made to TestSingleColumnValueFilter (your patch otherwise does not touch this file so doesn't belong in this patch)
        • TestColumnPaginationFilter has a copyright header from your company. This should be removed and replaced with the apache license.
        • ColumnPaginationFilter should be given an apache copyright header as well.

        Otherwise, this looks good. Haven't run unit tests yet but will try to get to that tonight.

        Good stuff Paul.

        Show
        Jonathan Gray added a comment - Patch looks good, just some nitpicks. Take out cleanup changes made to TestSingleColumnValueFilter (your patch otherwise does not touch this file so doesn't belong in this patch) TestColumnPaginationFilter has a copyright header from your company. This should be removed and replaced with the apache license. ColumnPaginationFilter should be given an apache copyright header as well. Otherwise, this looks good. Haven't run unit tests yet but will try to get to that tonight. Good stuff Paul.
        Hide
        Paul Kist added a comment -

        V.3 Update

        • Correct License Info
        Show
        Paul Kist added a comment - V.3 Update Correct License Info
        Paul Kist made changes -
        Attachment hbase-2438-0.20.3-v3.patch [ 12441774 ]
        Hide
        Jonathan Gray added a comment -

        Looks good Paul!

        Show
        Jonathan Gray added a comment - Looks good Paul!
        Hide
        Paul Kist added a comment -

        Thanks Jonathan, what's the next step? Let me know when we're set, and I can click the "Submit Patch" link.

        Show
        Paul Kist added a comment - Thanks Jonathan, what's the next step? Let me know when we're set, and I can click the "Submit Patch" link.
        Hide
        Jonathan Gray added a comment -

        Just click Submit Patch. That just flags it so a committer knows it's ready for commit.

        Show
        Jonathan Gray added a comment - Just click Submit Patch. That just flags it so a committer knows it's ready for commit.
        Hide
        Paul Kist added a comment -

        The correct patch is the v3 attachment.

        Show
        Paul Kist added a comment - The correct patch is the v3 attachment.
        Paul Kist made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Release Note  * A filter, based on the ColumnCountGetFilter, takes two arguments: limit and offset.
         * This filter can be used for row-based indexing, where references to other tables are stored across many columns, in order to efficient lookups and paginated results for end users.
        Hide
        stack added a comment -

        Committed branch and trunk. Thanks for the patch Paul.

        Show
        stack added a comment - Committed branch and trunk. Thanks for the patch Paul.
        stack made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Fix Version/s 0.20.5 [ 12314800 ]
        Fix Version/s 0.21.0 [ 12313607 ]
        Resolution Fixed [ 1 ]
        Hide
        stack added a comment -

        Marking these as fixed against 0.21.0 rather than against 0.20.5.

        Show
        stack added a comment - Marking these as fixed against 0.21.0 rather than against 0.20.5.
        stack made changes -
        Fix Version/s 0.20.5 [ 12314800 ]

          People

          • Assignee:
            Paul Kist
            Reporter:
            Paul Kist
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 8h
              8h
              Remaining:
              Remaining Estimate - 8h
              8h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development