Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-27734

Add Iceberg's storage-partitioned join capabilities to Hive's [sorted-]bucket-map-join

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 4.0.0-alpha-2
    • 4.1.0
    • Iceberg integration
    • None

    Description

      Iceberg's 'data bucketing' is implemented through its rich (function based) partitioning feature which helps to optimize join operations - called storage partitioned joins. 

      doc: https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit#heading=h.82w8qxfl2uwl
      spark impl.: https://issues.apache.org/jira/browse/SPARK-37375

      This feature is not yet leveraged in Hive into its bucket-map-join optimization, neither alone nor with Iceberg's SortOrder to sorted-bucket-map-join.
      Customers migrating from Hive table format to Iceberg format with storage optimized schema will experience performance degradation on large tables where Iceberg's gain on no-listing performance improvement is significantly smaller than the actual join performance over bucket-join or even sorted-bucket-join.
       

      SET hive.query.results.cache.enabled=false;
      SET hive.fetch.task.conversion = none;
      SET hive.optimize.bucketmapjoin=true;
      SET hive.convert.join.bucket.mapjoin.tez=true;
      SET hive.auto.convert.join.noconditionaltask.size=1000;
      --if you are working with external table, you need this for bmj:
      SET hive.disable.unsafe.external.table.operations=false;
      
      
      -- HIVE BUCKET-MAP-JOIN
      DROP TABLE IF EXISTS default.hivebmjt1 PURGE;
      DROP TABLE IF EXISTS default.hivebmjt2 PURGE;
      CREATE TABLE default.hivebmjt1 (id int, txt string) CLUSTERED BY (id) INTO 8 BUCKETS;
      CREATE TABLE default.hivebmjt2 (id int, txt string);
      INSERT INTO default.hivebmjt1 VALUES (1,'1'),(2,'2'),(3,'3'),(4,'4'),(5,'5'),(6,'6'),(7,'7'),(8,'8');
      INSERT INTO default.hivebmjt2 VALUES (1,'1'),(2,'2'),(3,'3'),(4,'4');
      
      EXPLAIN
      SELECT * FROM default.hivebmjt1 f INNER      JOIN default.hivebmjt2 d ON f.id = d.id;
      EXPLAIN
      SELECT * FROM default.hivebmjt1 f LEFT OUTER JOIN default.hivebmjt2 d ON f.id = d.id;
      -- Both are optimized into BMJ
      
      
      -- ICEBERG BUCKET-MAP-JOIN via Iceberg's storage-partitioned join
      DROP TABLE IF EXISTS default.icespbmjt1 PURGE;
      DROP TABLE IF EXISTS default.icespbmjt2 PURGE;
      CREATE TABLE default.icespbmjt1 (txt string) PARTITIONED BY (id int) STORED BY ICEBERG ;
      CREATE TABLE default.icespbmjt2 (txt string) PARTITIONED BY (id int) STORED BY ICEBERG ;
      INSERT INTO default.icespbmjt1 VALUES ('1',1),('2',2),('3',3),('4',4);
      INSERT INTO default.icespbmjt2 VALUES ('1',1),('2',2),('3',3),('4',4);
      
      EXPLAIN
      SELECT * FROM default.icespbmjt1 f INNER      JOIN default.icespbmjt2 d ON f.id = d.id;
      EXPLAIN
      SELECT * FROM default.icespbmjt1 f LEFT OUTER JOIN default.icespbmjt2 d ON f.id = d.id;
      -- Only Map-Join optimised
      

      Attachments

        Issue Links

          Activity

            People

              okumin Shohei Okumiya
              jkovacs@HW Janos Kovacs
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: