Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
I believe that a Hive query with an HBase Storage Handler is incorrectly applies a predicate pushdown into the storage handler.
I observed a FETCH optimization that took a long time to complete because it was performing a table scan across the entire HBase table.
The only case in which a predicate should be pushed down the storage layer is for
SELECT * FROM TABLE my_hbase_table WHERE row_key=?
This would be appropriate (EQ on the row key). Anything else will involve a scan of the table and there is no way to easily calculate how small a scan it will require and therefore should always be passed to the compute engine (Tez).
beeline> CREATE EXTERNAL TABLE t_hbase(key STRING, tinyint_col TINYINT, smallint_col SMALLINT, int_col INT, bigint_col BIGINT, float_col FLOAT, double_col DOUBLE, boolean_col BOOLEAN) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:binarykey#-,cf:binarybyte#-,cf:binaryshort#-,:key#-,cf:binarylong#-,cf:binaryfloat#-,cf:binarydouble#-,cf:binaryboolean#-") TBLPROPERTIES ("hbase.table.name" = "t_hive", "hbase.table.default.storage.type" = "binary", "external.table.purge" = "true") beeline> insert into table t_hbase values ('user1', 1, 11, 10, 1, 1.0, 1.0, true); beeline> explain select * from t_hbase where int_col=10; Explain Plan optimized by CBO. Stage-0 Fetch Operator limit:-1 Select Operator [SEL_2] Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"] TableScan [TS_0] Output: ["key","tinyint_col","smallint_col","bigint_col","float_col","double_col","boolean_col"]
You can see for this trivial example that this is an external table, Hive has no idea what's in it, but it is doing a Fetch + TableScan
Attachments
Issue Links
- is related to
-
HIVE-1643 support range scans and non-key columns in HBase filter pushdown
- Open
- relates to
-
HIVE-24845 Make Return of InputEstimator estimate Optional
- Open