Hive
  1. Hive
  2. HIVE-3603

Enable client-side caching for scans on HBase

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: HBase Handler
    • Labels:
      None

      Description

      HBaseHandler sets up a TableInputFormat MR job against HBase to read data in. The underlying implementation (in HBaseHandler.java) makes an RPC call per row-key, which makes it very inefficient. Need to specify a client side cache size on the scan.

      Note that HBase currently only supports num-rows based caching (no way to specify a memory limit). Created HBASE-6770 to address this.

        Issue Links

          Activity

          Hide
          Phabricator added a comment -

          navis requested code review of "HIVE-3603 [jira] Enable client-side caching for scans on HBase".
          Reviewers: JIRA

          DPAL-1955 Enable client-side caching for scans on HBase

          HBaseHandler sets up a TableInputFormat MR job against HBase to read data in. The underlying implementation (in HBaseHandler.java) makes an RPC call per row-key, which makes it very inefficient. Need to specify a client side cache size on the scan.

          Note that HBase currently only supports num-rows based caching (no way to specify a memory limit). Created HBASE-6770 to address this.

          TEST PLAN
          EMPTY

          REVISION DETAIL
          https://reviews.facebook.net/D7761

          AFFECTED FILES
          hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java
          hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStorageHandler.java
          hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java
          hbase-handler/src/test/queries/positive/hbase_scan_params.q
          hbase-handler/src/test/results/positive/hbase_scan_params.q.out

          MANAGE HERALD DIFFERENTIAL RULES
          https://reviews.facebook.net/herald/view/differential/

          WHY DID I GET THIS EMAIL?
          https://reviews.facebook.net/herald/transcript/18699/

          To: JIRA, navis

          Show
          Phabricator added a comment - navis requested code review of " HIVE-3603 [jira] Enable client-side caching for scans on HBase". Reviewers: JIRA DPAL-1955 Enable client-side caching for scans on HBase HBaseHandler sets up a TableInputFormat MR job against HBase to read data in. The underlying implementation (in HBaseHandler.java) makes an RPC call per row-key, which makes it very inefficient. Need to specify a client side cache size on the scan. Note that HBase currently only supports num-rows based caching (no way to specify a memory limit). Created HBASE-6770 to address this. TEST PLAN EMPTY REVISION DETAIL https://reviews.facebook.net/D7761 AFFECTED FILES hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStorageHandler.java hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java hbase-handler/src/test/queries/positive/hbase_scan_params.q hbase-handler/src/test/results/positive/hbase_scan_params.q.out MANAGE HERALD DIFFERENTIAL RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/18699/ To: JIRA, navis
          Hide
          Phabricator added a comment -

          zhenxiao has commented on the revision "HIVE-3603 [jira] Enable client-side caching for scans on HBase".

          Very minor comments.

          INLINE COMMENTS
          hbase-handler/src/test/queries/positive/hbase_scan_params.q:7 How about:
          select * from hbase_pushdown order by key limit 10;

          REVISION DETAIL
          https://reviews.facebook.net/D7761

          To: JIRA, navis
          Cc: zhenxiao

          Show
          Phabricator added a comment - zhenxiao has commented on the revision " HIVE-3603 [jira] Enable client-side caching for scans on HBase". Very minor comments. INLINE COMMENTS hbase-handler/src/test/queries/positive/hbase_scan_params.q:7 How about: select * from hbase_pushdown order by key limit 10; REVISION DETAIL https://reviews.facebook.net/D7761 To: JIRA, navis Cc: zhenxiao
          Hide
          Zhenxiao Luo added a comment -

          Non-Committer Review, minor comments at:
          https://reviews.facebook.net/D7761

          Show
          Zhenxiao Luo added a comment - Non-Committer Review, minor comments at: https://reviews.facebook.net/D7761
          Hide
          Phabricator added a comment -

          navis has commented on the revision "HIVE-3603 [jira] Enable client-side caching for scans on HBase".

          INLINE COMMENTS
          hbase-handler/src/test/queries/positive/hbase_scan_params.q:7 It makes MR task and I don't want add more test time. Furthermore, hbase returns rows always in deterministic way(ordered by row key)

          REVISION DETAIL
          https://reviews.facebook.net/D7761

          To: JIRA, navis
          Cc: zhenxiao

          Show
          Phabricator added a comment - navis has commented on the revision " HIVE-3603 [jira] Enable client-side caching for scans on HBase". INLINE COMMENTS hbase-handler/src/test/queries/positive/hbase_scan_params.q:7 It makes MR task and I don't want add more test time. Furthermore, hbase returns rows always in deterministic way(ordered by row key) REVISION DETAIL https://reviews.facebook.net/D7761 To: JIRA, navis Cc: zhenxiao
          Hide
          Zhenxiao Luo added a comment -

          Looks good to me.
          Non-committer +1.

          Show
          Zhenxiao Luo added a comment - Looks good to me. Non-committer +1.
          Hide
          Phabricator added a comment -

          ashutoshc has requested changes to the revision "HIVE-3603 [jira] Enable client-side caching for scans on HBase".

          Couple of comments on phabricator.

          INLINE COMMENTS
          hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java:68 It seems like its sufficient to set these strings in jobConf for client side caching to kick in. In that case, these strings must be interpreted by hbase as well. So, these must be defined constants in hbase code also, shall we just refer to those?
          hbase-handler/src/test/queries/positive/hbase_scan_params.q:7 How does this test verifies that client side caching kicked in? Did you do some manual verification to make sure caching is indeed taking place.

          REVISION DETAIL
          https://reviews.facebook.net/D7761

          BRANCH
          DPAL-1955

          To: JIRA, ashutoshc, navis
          Cc: zhenxiao

          Show
          Phabricator added a comment - ashutoshc has requested changes to the revision " HIVE-3603 [jira] Enable client-side caching for scans on HBase". Couple of comments on phabricator. INLINE COMMENTS hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java:68 It seems like its sufficient to set these strings in jobConf for client side caching to kick in. In that case, these strings must be interpreted by hbase as well. So, these must be defined constants in hbase code also, shall we just refer to those? hbase-handler/src/test/queries/positive/hbase_scan_params.q:7 How does this test verifies that client side caching kicked in? Did you do some manual verification to make sure caching is indeed taking place. REVISION DETAIL https://reviews.facebook.net/D7761 BRANCH DPAL-1955 To: JIRA, ashutoshc, navis Cc: zhenxiao
          Hide
          Phabricator added a comment -

          navis has commented on the revision "HIVE-3603 [jira] Enable client-side caching for scans on HBase".

          INLINE COMMENTS
          hbase-handler/src/test/queries/positive/hbase_scan_params.q:7 I just checked scanner is configured with those values. Could you provide some idea for test verifying this?
          hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java:68 Those are configuration for scanner and should be configured per TS, IMHO, isn't it?
          HIVE-2906 might be a help for configuring this.

          REVISION DETAIL
          https://reviews.facebook.net/D7761

          BRANCH
          DPAL-1955

          To: JIRA, ashutoshc, navis
          Cc: zhenxiao

          Show
          Phabricator added a comment - navis has commented on the revision " HIVE-3603 [jira] Enable client-side caching for scans on HBase". INLINE COMMENTS hbase-handler/src/test/queries/positive/hbase_scan_params.q:7 I just checked scanner is configured with those values. Could you provide some idea for test verifying this? hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java:68 Those are configuration for scanner and should be configured per TS, IMHO, isn't it? HIVE-2906 might be a help for configuring this. REVISION DETAIL https://reviews.facebook.net/D7761 BRANCH DPAL-1955 To: JIRA, ashutoshc, navis Cc: zhenxiao
          Hide
          Edward Capriolo added a comment -

          I am +1. Is anyone not +1. The patch seems to enable scanner caching by setting the appropriate properties, verifying it works is going to be tricky since we have no simple way of counting the RPC calls the underlying hbase client will make.

          Show
          Edward Capriolo added a comment - I am +1. Is anyone not +1. The patch seems to enable scanner caching by setting the appropriate properties, verifying it works is going to be tricky since we have no simple way of counting the RPC calls the underlying hbase client will make.
          Hide
          Edward Capriolo added a comment -

          Thanks Navis.

          Show
          Edward Capriolo added a comment - Thanks Navis.
          Hide
          Hudson added a comment -

          ABORTED: Integrated in Hive-trunk-hadoop1-ptest #86 (See https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/86/)
          HIVE-3603 Enable client-side caching for scans on HBase (Navis Ryu via EGC)

          Submitted by: Navis Ryu
          Reviewed by: Edward Capriolo (ecapriolo: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1503544)

          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java
          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStorageHandler.java
          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java
          • /hive/trunk/hbase-handler/src/test/queries/positive/hbase_scan_params.q
          • /hive/trunk/hbase-handler/src/test/results/positive/hbase_scan_params.q.out
          Show
          Hudson added a comment - ABORTED: Integrated in Hive-trunk-hadoop1-ptest #86 (See https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/86/ ) HIVE-3603 Enable client-side caching for scans on HBase (Navis Ryu via EGC) Submitted by: Navis Ryu Reviewed by: Edward Capriolo (ecapriolo: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1503544 ) /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStorageHandler.java /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java /hive/trunk/hbase-handler/src/test/queries/positive/hbase_scan_params.q /hive/trunk/hbase-handler/src/test/results/positive/hbase_scan_params.q.out
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hive-trunk-hadoop2-ptest #16 (See https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/16/)
          HIVE-3603 Enable client-side caching for scans on HBase (Navis Ryu via EGC)

          Submitted by: Navis Ryu
          Reviewed by: Edward Capriolo (ecapriolo: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1503544)

          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java
          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStorageHandler.java
          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java
          • /hive/trunk/hbase-handler/src/test/queries/positive/hbase_scan_params.q
          • /hive/trunk/hbase-handler/src/test/results/positive/hbase_scan_params.q.out
          Show
          Hudson added a comment - FAILURE: Integrated in Hive-trunk-hadoop2-ptest #16 (See https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/16/ ) HIVE-3603 Enable client-side caching for scans on HBase (Navis Ryu via EGC) Submitted by: Navis Ryu Reviewed by: Edward Capriolo (ecapriolo: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1503544 ) /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStorageHandler.java /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java /hive/trunk/hbase-handler/src/test/queries/positive/hbase_scan_params.q /hive/trunk/hbase-handler/src/test/results/positive/hbase_scan_params.q.out
          Hide
          Hudson added a comment -

          ABORTED: Integrated in Hive-trunk-h0.21 #2200 (See https://builds.apache.org/job/Hive-trunk-h0.21/2200/)
          HIVE-3603 Enable client-side caching for scans on HBase (Navis Ryu via EGC)

          Submitted by: Navis Ryu
          Reviewed by: Edward Capriolo (ecapriolo: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1503544)

          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java
          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStorageHandler.java
          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java
          • /hive/trunk/hbase-handler/src/test/queries/positive/hbase_scan_params.q
          • /hive/trunk/hbase-handler/src/test/results/positive/hbase_scan_params.q.out
          Show
          Hudson added a comment - ABORTED: Integrated in Hive-trunk-h0.21 #2200 (See https://builds.apache.org/job/Hive-trunk-h0.21/2200/ ) HIVE-3603 Enable client-side caching for scans on HBase (Navis Ryu via EGC) Submitted by: Navis Ryu Reviewed by: Edward Capriolo (ecapriolo: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1503544 ) /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStorageHandler.java /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java /hive/trunk/hbase-handler/src/test/queries/positive/hbase_scan_params.q /hive/trunk/hbase-handler/src/test/results/positive/hbase_scan_params.q.out
          Hide
          Hudson added a comment -

          ABORTED: Integrated in Hive-trunk-hadoop2 #290 (See https://builds.apache.org/job/Hive-trunk-hadoop2/290/)
          HIVE-3603 Enable client-side caching for scans on HBase (Navis Ryu via EGC)

          Submitted by: Navis Ryu
          Reviewed by: Edward Capriolo (ecapriolo: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1503544)

          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java
          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStorageHandler.java
          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java
          • /hive/trunk/hbase-handler/src/test/queries/positive/hbase_scan_params.q
          • /hive/trunk/hbase-handler/src/test/results/positive/hbase_scan_params.q.out
          Show
          Hudson added a comment - ABORTED: Integrated in Hive-trunk-hadoop2 #290 (See https://builds.apache.org/job/Hive-trunk-hadoop2/290/ ) HIVE-3603 Enable client-side caching for scans on HBase (Navis Ryu via EGC) Submitted by: Navis Ryu Reviewed by: Edward Capriolo (ecapriolo: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1503544 ) /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStorageHandler.java /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java /hive/trunk/hbase-handler/src/test/queries/positive/hbase_scan_params.q /hive/trunk/hbase-handler/src/test/results/positive/hbase_scan_params.q.out
          Hide
          Swarnim Kulkarni added a comment -

          Navis Any specific reason we chose not to provide a default value for caching? In this case, it seems like the caching would only kick in if the user explicitly demands for it by specifying the property "hbase.scan.cache" as a part of the DDL. However I don't see any case where caching to a default value won't be useful. Thoughts?

          Show
          Swarnim Kulkarni added a comment - Navis Any specific reason we chose not to provide a default value for caching? In this case, it seems like the caching would only kick in if the user explicitly demands for it by specifying the property "hbase.scan.cache" as a part of the DDL. However I don't see any case where caching to a default value won't be useful. Thoughts?
          Hide
          Edward Capriolo added a comment -

          You can add this to your hive~site.xml

          https://issues.apache.org/jira/browse/HIVE-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13728202#comment-13728202]
          caching? In this case, it seems like the caching would only kick in if the
          user explicitly demands for it by specifying the property
          "hbase.scan.cache" as a part of the DDL. However I don't see any case where
          caching to a default value won't be useful. Thoughts?
          data in. The underlying implementation (in HBaseHandler.java) makes an RPC
          call per row-key, which makes it very inefficient. Need to specify a client
          side cache size on the scan.
          to specify a memory limit). Created HBASE-6770 to address this.
          administrators

          Show
          Edward Capriolo added a comment - You can add this to your hive~site.xml https://issues.apache.org/jira/browse/HIVE-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13728202#comment-13728202 ] caching? In this case, it seems like the caching would only kick in if the user explicitly demands for it by specifying the property "hbase.scan.cache" as a part of the DDL. However I don't see any case where caching to a default value won't be useful. Thoughts? data in. The underlying implementation (in HBaseHandler.java) makes an RPC call per row-key, which makes it very inefficient. Need to specify a client side cache size on the scan. to specify a memory limit). Created HBASE-6770 to address this. administrators
          Hide
          Swarnim Kulkarni added a comment -

          Edward Capriolo Thanks! Also how is setting this property different than directly setting the "hbase.client.scanner.caching" property in hive-site.xml without this enhancement? Wouldn't they have the same effect?

          Show
          Swarnim Kulkarni added a comment - Edward Capriolo Thanks! Also how is setting this property different than directly setting the "hbase.client.scanner.caching" property in hive-site.xml without this enhancement? Wouldn't they have the same effect?
          Hide
          Ashutosh Chauhan added a comment -

          This issue has been fixed and released as part of 0.12 release. If you find further issues, please create a new jira and link it to this one.

          Show
          Ashutosh Chauhan added a comment - This issue has been fixed and released as part of 0.12 release. If you find further issues, please create a new jira and link it to this one.

            People

            • Assignee:
              Navis
              Reporter:
              Karthik Ranganathan
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development