Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24833

Hive Uses Fetch Task For HBaseStorageHandler Scans

Log workAgile BoardRank to TopRank to BottomAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: HBase Handler
    • Labels:
      None

      Description

      I believe that a Hive query with an HBase Storage Handler is incorrectly applies a predicate pushdown into the storage handler.

      I observed a FETCH optimization that took a long time to complete because it was performing a table scan across the entire HBase table.

      The only case in which a predicate should be pushed down the storage layer is for

      SELECT * FROM TABLE my_hbase_table WHERE row_key=?
      

      This would be appropriate (EQ on the row key). Anything else will involve a scan of the table and there is no way to easily calculate how small a scan it will require and therefore should always be passed to the compute engine (Tez).

      beeline> CREATE EXTERNAL TABLE t_hbase(key STRING,
                           tinyint_col TINYINT,
                           smallint_col SMALLINT,
                           int_col INT,
                           bigint_col BIGINT,
                           float_col FLOAT,
                           double_col DOUBLE,
                           boolean_col BOOLEAN)
      STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:binarykey#-,cf:binarybyte#-,cf:binaryshort#-,:key#-,cf:binarylong#-,cf:binaryfloat#-,cf:binarydouble#-,cf:binaryboolean#-")
      TBLPROPERTIES ("hbase.table.name" = "t_hive",
                     "hbase.table.default.storage.type" = "binary",
                     "external.table.purge" = "true")
                     
                     
                     
      beeline> insert into table t_hbase values ('user1', 1, 11, 10, 1, 1.0, 1.0, true);
      
      beeline> explain select * from t_hbase where int_col=10;
      
      Explain
        Plan optimized by CBO.
        Stage-0
        Fetch Operator
          limit:-1
          Select Operator [SEL_2]
            Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"]
          TableScan [TS_0]
         Output: 
         ["key","tinyint_col","smallint_col","bigint_col","float_col","double_col","boolean_col"]
      

      You can see for this trivial example that this is an external table, Hive has no idea what's in it, but it is doing a Fetch + TableScan

        Attachments

        Issue Links

          Activity

          $i18n.getText('security.level.explanation', $currentSelection) Viewable by All Users
          Cancel

            People

            • Assignee:
              Unassigned Assign to me
              Reporter:
              belugabehr David Mollitor

              Dates

              • Created:
                Updated:

                Issue deployment