[HIVE-27734] Add Iceberg's storage-partitioned join capabilities to Hive's [sorted-]bucket-map-join - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 4.0.0-alpha-2
Fix Version/s: 4.1.0
Component/s: Iceberg integration
Labels:
None

Description

Iceberg's 'data bucketing' is implemented through its rich (function based) partitioning feature which helps to optimize join operations - called storage partitioned joins.

doc: https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit#heading=h.82w8qxfl2uwl
spark impl.: https://issues.apache.org/jira/browse/SPARK-37375

This feature is not yet leveraged in Hive into its bucket-map-join optimization, neither alone nor with Iceberg's SortOrder to sorted-bucket-map-join.
Customers migrating from Hive table format to Iceberg format with storage optimized schema will experience performance degradation on large tables where Iceberg's gain on no-listing performance improvement is significantly smaller than the actual join performance over bucket-join or even sorted-bucket-join.

SET hive.query.results.cache.enabled=false;
SET hive.fetch.task.conversion = none;
SET hive.optimize.bucketmapjoin=true;
SET hive.convert.join.bucket.mapjoin.tez=true;
SET hive.auto.convert.join.noconditionaltask.size=1000;
--if you are working with external table, you need this for bmj:
SET hive.disable.unsafe.external.table.operations=false;


-- HIVE BUCKET-MAP-JOIN
DROP TABLE IF EXISTS default.hivebmjt1 PURGE;
DROP TABLE IF EXISTS default.hivebmjt2 PURGE;
CREATE TABLE default.hivebmjt1 (id int, txt string) CLUSTERED BY (id) INTO 8 BUCKETS;
CREATE TABLE default.hivebmjt2 (id int, txt string);
INSERT INTO default.hivebmjt1 VALUES (1,'1'),(2,'2'),(3,'3'),(4,'4'),(5,'5'),(6,'6'),(7,'7'),(8,'8');
INSERT INTO default.hivebmjt2 VALUES (1,'1'),(2,'2'),(3,'3'),(4,'4');

EXPLAIN
SELECT * FROM default.hivebmjt1 f INNER      JOIN default.hivebmjt2 d ON f.id = d.id;
EXPLAIN
SELECT * FROM default.hivebmjt1 f LEFT OUTER JOIN default.hivebmjt2 d ON f.id = d.id;
-- Both are optimized into BMJ


-- ICEBERG BUCKET-MAP-JOIN via Iceberg's storage-partitioned join
DROP TABLE IF EXISTS default.icespbmjt1 PURGE;
DROP TABLE IF EXISTS default.icespbmjt2 PURGE;
CREATE TABLE default.icespbmjt1 (txt string) PARTITIONED BY (id int) STORED BY ICEBERG ;
CREATE TABLE default.icespbmjt2 (txt string) PARTITIONED BY (id int) STORED BY ICEBERG ;
INSERT INTO default.icespbmjt1 VALUES ('1',1),('2',2),('3',3),('4',4);
INSERT INTO default.icespbmjt2 VALUES ('1',1),('2',2),('3',3),('4',4);

EXPLAIN
SELECT * FROM default.icespbmjt1 f INNER      JOIN default.icespbmjt2 d ON f.id = d.id;
EXPLAIN
SELECT * FROM default.icespbmjt1 f LEFT OUTER JOIN default.icespbmjt2 d ON f.id = d.id;
-- Only Map-Join optimised

Attachments

Issue Links

duplicates

HIVE-28410 Partition-Aware Optimization for Iceberg or OTF

Open

split to

HIVE-28412 Sort Merge Bucket Join for Iceberg tables

Open

HIVE-28411 Bucket Map Join on Iceberg tables

Resolved

links to

Design document

Activity

People

Assignee:: Shohei Okumiya

Reporter:: Janos Kovacs

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 26/Sep/23 09:27

Updated:: 01/Nov/24 10:41

Resolved:: 01/Nov/24 10:41