Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-26674

REBALANCE type compaction

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 4.0.0-beta-1
    • None

    Description

      Problem statement: 

      Without explicit bucketing defined, bucket files are very sensitive to the amount of data loaded/modified in the table. 

      When 

      • there are initial or larger time-window loads or reloads beside smaller load schedules (like initial and monthly vs. daily loads)
      • or even if load scheduling is periodic but the volume of the data changes are not, 
      • or even if data volume and periodicity are all balanced but runtime resources affect the loader application to run on different number of tasks

      The data loaded into non-explicitly bucketed full-acid ORC tables can lead to unbalanced bucketed tables over time!

      The number of buckets is calculated from the amount of data to be loaded. If the table is created with a huge amount of initial data (which will create several buckets), and then only a few records are added to it (which will be written only into the first 1-2 buckets), but frequently, the result will be that the data is unbalanced within the buckets. The first few buckets will contain much more data than the others.

      Concept:

      Rebalancing compaction

      A new compaction type (‘REBALANCE’) should be created to address the issue for badly balanced data among buckets. This compaction type would result in a table like an INSERT-OVERWRITE would lead to. New base and independent bucket indexes from the previous base or deltas. The new number of buckets can be optionally supplied, otherwise the new table would still have the same amount of buckets, but with re-balanced data.

      Sorting

      Optionally, a sorting expression can be supplied, to be able to re-sort the data during the rebalance.

      The expression can be supplied in two ways:

      • Via the ALTER TABLE COMPACT:
        ALTER TABLE COMPACT <table> ‘REBALANCE’ ORDER BY <column> ASC|DESC

      Manual rebalance

      The rebalance request can be created by using the ALTER TABLE COMPACT command (E.g. manual compaction).

      Limitations

      • Rebalancing can be done only within partitions.
      • Rebalancing is not possible on explicitly bucketed (clustered) tables
      • Rebalancing is not possible via MR based compaction
      • Rebalancing is not supported on insert-only tables

      Implications

      Compaction request (DB schema) changes

      • A new compaction type (REBALANCE) must be added to the allowed compaction TYPES.
      • A new optional field (and nullable DB column) is required to store the number of requested implicit buckets.

      ALTER TABLE COMPACT changes

      The ALTER TABLE COMPACT command must accept the 

      • ‘REBALANCE’, compaction type 
      • optionally the new number of the required buckets (... INTO {N} BUCKETS).
      • Optionally the sorting expression (ORDER BY column ASC, columnB DESC)

      Compactor changes

      Both the MR and query based compaction tasks must be enhanced with the ability to do a rebalancing compaction.

      Query based compaction changes

      New compactor implementations are required:

      • Query based rebalance compactor for fully acid tables

      MR based compaction changes

      MR is deprecated, rebalancing compaction will only be implemented, if it’s really easy to do so.

      Open points

      Attachments

        Activity

          People

            veghlaci05 László Végh
            veghlaci05 László Végh
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 31.5h
                31.5h