HBase
  1. HBase
  2. HBASE-5104

Provide a reliable intra-row pagination mechanism

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Addendum:

      Doing pagination (retrieving at most "limit" number of KVs at a particular "offset") is currently supported via the ColumnPaginationFilter. However, it is not a very clean way of supporting pagination. Some of the problems with it are:

      • Normally, one would expect a query with (Filter(A) AND Filter(B)) to have same results as (query with Filter(A)) INTERSECT (query with Filter(B)). This is not the case for ColumnPaginationFilter as its internal state gets updated depending on whether or not Filter(A) returns TRUE/FALSE for a particular cell.
      • When this Filter is used in combination with other filters (e.g., doing AND with another filter using FilterList), the behavior of the query depends on the order of filters in the FilterList. This is not ideal.
      • ColumnPaginationFilter is a stateful filter which ends up counting multiple versions of the cell as separate values even if another filter upstream or the ScanQueryMatcher is going to reject the value for other reasons.

      Seems like we need a reliable way to do pagination. The particular use case that prompted this JIRA is pagination within the same rowKey. For example, for a given row key R, get columns with prefix P, starting at offset X (among columns which have prefix P) and limit Y. Some possible fixes might be:

      1) enhance ColumnPrefixFilter to support another constructor which supports limit/offset.
      2) Support pagination (limit/offset) at the Scan/Get API level (rather than as a filter) [Like SQL].

      Original Post:

      Thanks Jiakai Liu for reporting this issue and doing the initial investigation. Email from Jiakai below:

      Assuming that we have an index column family with the following entries:
      "tag0:001:thread1"
      ...
      "tag1:001:thread1"
      "tag1:002:thread2"
      ...
      "tag1:010:thread10"
      ...
      "tag2:001:thread1"
      "tag2:005:thread5"
      ...

      To get threads with "tag1" in range [5, 10), I tried the following code:

      ColumnPrefixFilter filter1 = new ColumnPrefixFilter(Bytes.toBytes("tag1"));
      ColumnPaginationFilter filter2 = new ColumnPaginationFilter(5 /* limit /, 5 / offset */);

      FilterList filters = new FilterList(Operator.MUST_PASS_ALL);
      filters.addFilter(filter1);
      filters.addFilter(filter2);

      Get get = new Get(USER);
      get.addFamily(COLUMN_FAMILY);
      get.setMaxVersions(1);
      get.setFilter(filters);

      Somehow it didn't work as expected. It returned the entries as if the filter1 were not set.

      Turns out the ColumnPrefixFilter returns SEEK_NEXT_USING_HINT in some cases. The FilterList filter does not handle this return code properly (treat it as INCLUDE).

        Issue Links

          Activity

          Andrew Olson made changes -
          Link This issue relates to HBASE-6954 [ HBASE-6954 ]
          Mikhail Bautin made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Mikhail Bautin made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Mikhail Bautin made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Phabricator made changes -
          Attachment D2799.6.patch [ 12526586 ]
          Phabricator made changes -
          Attachment D2799.5.patch [ 12525758 ]
          Phabricator made changes -
          Attachment D2799.4.patch [ 12524087 ]
          Phabricator made changes -
          Attachment D2799.3.patch [ 12522735 ]
          Phabricator made changes -
          Attachment D2799.2.patch [ 12522665 ]
          Phabricator made changes -
          Attachment D2799.1.patch [ 12522655 ]
          Mikhail Bautin made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Lars Hofhansl made changes -
          Link This issue relates to HBASE-5229 [ HBASE-5229 ]
          Kannan Muthukkaruppan made changes -
          Summary Provide a reliable pagination mechanism Provide a reliable intra-row pagination mechanism
          Description Addendum:

          Doing pagination (retrieving at most "limit" number of KVs at a particular "offset") is currently supported via the ColumnPaginationFilter. However, it is nota very clean way of supporting pagination. Some of the problems with it are:

          * Normally, one would expect a query with (Filter(A) AND Filter(B)) to have same results as (query with Filter(A)) INTERSECT (query with Filter(B)). This is not the case for ColumnPaginationFilter as its internal state gets updated depending on whether or not Filter(A) returns TRUE/FALSE for a particular cell.
          * When this Filter is used in combination with other filters (e.g., doing AND with another filter using FilterList), the behavior of the query depends on the order of filters in the FilterList. This is not ideal.
          * ColumnPaginationFilter is a stateful filter which ends up counting multiple versions of the cell as separate values even if another filter upstream or the ScanQueryMatcher is going to reject the value for other reasons.

          Seems like we need a reliable way to do pagination. The particular use case that prompted this JIRA is pagination within the same rowKey. For example, for a given row key R, get columns with prefix P, starting at offset X (among columns which have prefix P) and limit Y. Some possible fixes might be:

          1) enhance ColumnPrefixFilter to support another constructor which supports limit/offset.
          2) Support pagination (limit/offset) at the Scan/Get API level (rather than as a filter) [Like SQL].

          Original Post:

          Thanks Jiakai Liu for reporting this issue and doing the initial investigation. Email from Jiakai below:

          Assuming that we have an index column family with the following entries:
          "tag0:001:thread1"
          ...
          "tag1:001:thread1"
          "tag1:002:thread2"
          ...
          "tag1:010:thread10"
          ...
          "tag2:001:thread1"
          "tag2:005:thread5"
          ...

          To get threads with "tag1" in range [5, 10), I tried the following code:

              ColumnPrefixFilter filter1 = new ColumnPrefixFilter(Bytes.toBytes("tag1"));
              ColumnPaginationFilter filter2 = new ColumnPaginationFilter(5 /* limit */, 5 /* offset */);

              FilterList filters = new FilterList(Operator.MUST_PASS_ALL);
              filters.addFilter(filter1);
              filters.addFilter(filter2);

              Get get = new Get(USER);
              get.addFamily(COLUMN_FAMILY);
              get.setMaxVersions(1);
              get.setFilter(filters);

          Somehow it didn't work as expected. It returned the entries as if the filter1 were not set.

          Turns out the ColumnPrefixFilter returns SEEK_NEXT_USING_HINT in some cases. The FilterList filter does not handle this return code properly (treat it as INCLUDE).
          Addendum:

          Doing pagination (retrieving at most "limit" number of KVs at a particular "offset") is currently supported via the ColumnPaginationFilter. However, it is not a very clean way of supporting pagination. Some of the problems with it are:

          * Normally, one would expect a query with (Filter(A) AND Filter(B)) to have same results as (query with Filter(A)) INTERSECT (query with Filter(B)). This is not the case for ColumnPaginationFilter as its internal state gets updated depending on whether or not Filter(A) returns TRUE/FALSE for a particular cell.
          * When this Filter is used in combination with other filters (e.g., doing AND with another filter using FilterList), the behavior of the query depends on the order of filters in the FilterList. This is not ideal.
          * ColumnPaginationFilter is a stateful filter which ends up counting multiple versions of the cell as separate values even if another filter upstream or the ScanQueryMatcher is going to reject the value for other reasons.

          Seems like we need a reliable way to do pagination. The particular use case that prompted this JIRA is pagination within the same rowKey. For example, for a given row key R, get columns with prefix P, starting at offset X (among columns which have prefix P) and limit Y. Some possible fixes might be:

          1) enhance ColumnPrefixFilter to support another constructor which supports limit/offset.
          2) Support pagination (limit/offset) at the Scan/Get API level (rather than as a filter) [Like SQL].

          Original Post:

          Thanks Jiakai Liu for reporting this issue and doing the initial investigation. Email from Jiakai below:

          Assuming that we have an index column family with the following entries:
          "tag0:001:thread1"
          ...
          "tag1:001:thread1"
          "tag1:002:thread2"
          ...
          "tag1:010:thread10"
          ...
          "tag2:001:thread1"
          "tag2:005:thread5"
          ...

          To get threads with "tag1" in range [5, 10), I tried the following code:

              ColumnPrefixFilter filter1 = new ColumnPrefixFilter(Bytes.toBytes("tag1"));
              ColumnPaginationFilter filter2 = new ColumnPaginationFilter(5 /* limit */, 5 /* offset */);

              FilterList filters = new FilterList(Operator.MUST_PASS_ALL);
              filters.addFilter(filter1);
              filters.addFilter(filter2);

              Get get = new Get(USER);
              get.addFamily(COLUMN_FAMILY);
              get.setMaxVersions(1);
              get.setFilter(filters);

          Somehow it didn't work as expected. It returned the entries as if the filter1 were not set.

          Turns out the ColumnPrefixFilter returns SEEK_NEXT_USING_HINT in some cases. The FilterList filter does not handle this return code properly (treat it as INCLUDE).
          Kannan Muthukkaruppan made changes -
          Summary FilterList doesn't work right with ColumnPaginationFilter Provide a reliable pagination mechanism
          Description Thanks Jiakai Liu for reporting this issue and doing the initial investigation. Email from Jiakai below:

          Assuming that we have an index column family with the following entries:
          "tag0:001:thread1"
          ...
          "tag1:001:thread1"
          "tag1:002:thread2"
          ...
          "tag1:010:thread10"
          ...
          "tag2:001:thread1"
          "tag2:005:thread5"
          ...

          To get threads with "tag1" in range [5, 10), I tried the following code:

              ColumnPrefixFilter filter1 = new ColumnPrefixFilter(Bytes.toBytes("tag1"));
              ColumnPaginationFilter filter2 = new ColumnPaginationFilter(5 /* limit */, 5 /* offset */);

              FilterList filters = new FilterList(Operator.MUST_PASS_ALL);
              filters.addFilter(filter1);
              filters.addFilter(filter2);

              Get get = new Get(USER);
              get.addFamily(COLUMN_FAMILY);
              get.setMaxVersions(1);
              get.setFilter(filters);

          Somehow it didn't work as expected. It returned the entries as if the filter1 were not set.

          Turns out the ColumnPrefixFilter returns SEEK_NEXT_USING_HINT in some cases. The FilterList filter does not handle this return code properly (treat it as INCLUDE).
          Addendum:

          Doing pagination (retrieving at most "limit" number of KVs at a particular "offset") is currently supported via the ColumnPaginationFilter. However, it is nota very clean way of supporting pagination. Some of the problems with it are:

          * Normally, one would expect a query with (Filter(A) AND Filter(B)) to have same results as (query with Filter(A)) INTERSECT (query with Filter(B)). This is not the case for ColumnPaginationFilter as its internal state gets updated depending on whether or not Filter(A) returns TRUE/FALSE for a particular cell.
          * When this Filter is used in combination with other filters (e.g., doing AND with another filter using FilterList), the behavior of the query depends on the order of filters in the FilterList. This is not ideal.
          * ColumnPaginationFilter is a stateful filter which ends up counting multiple versions of the cell as separate values even if another filter upstream or the ScanQueryMatcher is going to reject the value for other reasons.

          Seems like we need a reliable way to do pagination. The particular use case that prompted this JIRA is pagination within the same rowKey. For example, for a given row key R, get columns with prefix P, starting at offset X (among columns which have prefix P) and limit Y. Some possible fixes might be:

          1) enhance ColumnPrefixFilter to support another constructor which supports limit/offset.
          2) Support pagination (limit/offset) at the Scan/Get API level (rather than as a filter) [Like SQL].

          Original Post:

          Thanks Jiakai Liu for reporting this issue and doing the initial investigation. Email from Jiakai below:

          Assuming that we have an index column family with the following entries:
          "tag0:001:thread1"
          ...
          "tag1:001:thread1"
          "tag1:002:thread2"
          ...
          "tag1:010:thread10"
          ...
          "tag2:001:thread1"
          "tag2:005:thread5"
          ...

          To get threads with "tag1" in range [5, 10), I tried the following code:

              ColumnPrefixFilter filter1 = new ColumnPrefixFilter(Bytes.toBytes("tag1"));
              ColumnPaginationFilter filter2 = new ColumnPaginationFilter(5 /* limit */, 5 /* offset */);

              FilterList filters = new FilterList(Operator.MUST_PASS_ALL);
              filters.addFilter(filter1);
              filters.addFilter(filter2);

              Get get = new Get(USER);
              get.addFamily(COLUMN_FAMILY);
              get.setMaxVersions(1);
              get.setFilter(filters);

          Somehow it didn't work as expected. It returned the entries as if the filter1 were not set.

          Turns out the ColumnPrefixFilter returns SEEK_NEXT_USING_HINT in some cases. The FilterList filter does not handle this return code properly (treat it as INCLUDE).
          Kannan Muthukkaruppan made changes -
          Summary FilterList doesn't work right with filters (such as ColumPrefixFilter) which use the SEEK_NEXT_USING_HINT FilterList doesn't work right with ColumnPaginationFilter
          Kannan Muthukkaruppan made changes -
          Field Original Value New Value
          Attachment testFilterList.rb [ 12508882 ]
          Kannan Muthukkaruppan created issue -

            People

            • Assignee:
              Madhuwanti Vaidya
              Reporter:
              Kannan Muthukkaruppan
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development