[HIVE-22702] ALTER TABLE REMOVE PARTITION is inefficient - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: Database/Schema
Labels:
None

Description

I recently realized the poor partitioning of a table of mine was becoming a major bottleneck and endeavored to reset the partitioning.

At this point, the table had about 56K partitions (year|month|day|city) combinations; moving to the more efficient year|month partitions means there's about 24.

In the process, I was having trouble fixing the registration of the table because of the size of its partition DB; I happened upon this SO Q&A which addresses the same issue:

https://stackoverflow.com/questions/50715939/drop-table-in-hive-via-spark-hangs/50814566#comment105440563_50814566

I set about batching through ALTER TABLE x DROP PARTITION (...), PARTITION (...) 200 at a time; it would run for about 2 hours to accomplish this, which strikes me as being quite inefficient.

(apologies that I haven't done a fully proper analysis of the scaling efficiency in this ticket)

If I were designing it from scratch, I would:

Keep the database of existing partitions sorted
Sort the incoming partitions to remove
Iterate via "shrinking binary search" (each partition is searched with binary search, and we can eliminate from the existing DB anything "less than" the current index when moving to the next iteration)

Is there something preventing this from being achieved?

Attachments

Issue Links

is duplicated by

HIVE-6980 Drop table by using direct sql

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Michael Chirico

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Jan/20 05:35

Updated:: 08/Jan/20 09:08

Resolved:: 08/Jan/20 09:08