Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Problem statement:
Without explicit bucketing defined, bucket files are very sensitive to the amount of data loaded/modified in the table.
When
- there are initial or larger time-window loads or reloads beside smaller load schedules (like initial and monthly vs. daily loads)
- or even if load scheduling is periodic but the volume of the data changes are not,
- or even if data volume and periodicity are all balanced but runtime resources affect the loader application to run on different number of tasks
The data loaded into non-explicitly bucketed full-acid ORC tables can lead to unbalanced bucketed tables over time!
The number of buckets is calculated from the amount of data to be loaded. If the table is created with a huge amount of initial data (which will create several buckets), and then only a few records are added to it (which will be written only into the first 1-2 buckets), but frequently, the result will be that the data is unbalanced within the buckets. The first few buckets will contain much more data than the others.
Concept:
Rebalancing compaction
A new compaction type (‘REBALANCE’) should be created to address the issue for badly balanced data among buckets. This compaction type would result in a table like an INSERT-OVERWRITE would lead to. New base and independent bucket indexes from the previous base or deltas. The new number of buckets can be optionally supplied, otherwise the new table would still have the same amount of buckets, but with re-balanced data.
Sorting
Optionally, a sorting expression can be supplied, to be able to re-sort the data during the rebalance.
The expression can be supplied in two ways:
- Via the ALTER TABLE COMPACT:
ALTER TABLE COMPACT <table> ‘REBALANCE’ ORDER BY <column> ASC|DESC
Manual rebalance
The rebalance request can be created by using the ALTER TABLE COMPACT command (E.g. manual compaction).
Limitations
- Rebalancing can be done only within partitions.
- Rebalancing is not possible on explicitly bucketed (clustered) tables
- Rebalancing is not possible via MR based compaction
- Rebalancing is not supported on insert-only tables
Implications
Compaction request (DB schema) changes
- A new compaction type (REBALANCE) must be added to the allowed compaction TYPES.
- A new optional field (and nullable DB column) is required to store the number of requested implicit buckets.
ALTER TABLE COMPACT changes
The ALTER TABLE COMPACT command must accept the
- ‘REBALANCE’, compaction type
- optionally the new number of the required buckets (... INTO {N} BUCKETS).
- Optionally the sorting expression (ORDER BY column ASC, columnB DESC)
Compactor changes
Both the MR and query based compaction tasks must be enhanced with the ability to do a rebalancing compaction.
Query based compaction changes
New compactor implementations are required:
- Query based rebalance compactor for fully acid tables
MR based compaction changes
MR is deprecated, rebalancing compaction will only be implemented, if it’s really easy to do so.
Open points
Attachments
1.
|
Query based Rebalance compaction on full acid tables | Closed | László Végh |
|
||||||||
2.
|
Query based Rebalance compaction on insert-only tables | Resolved | László Végh |
|
||||||||
3.
|
Enable initiator to schedule rebalancing compactions | Resolved | László Végh |
|
||||||||
4.
|
Ability to set number of buckets manually | Closed | László Végh |
|
||||||||
5.
|
Reorganize tests | Open | Unassigned | |||||||||
6.
|
Ability to sort the data during rebalancing compaction | Resolved | László Végh |
|
||||||||
7.
|
Revert HIVE-26717 and HIVE-26718 | Closed | László Végh |
|