[HIVE-18098] Add support for Export/Import for Acid tables - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: 3.0.0
Component/s: Transactions
Labels:
None

Target Version/s:

3.0.0
Release Note:
n/a

Description

How should this work?
For regular tables export just copies the files under table root to a specified directory.
This doesn't make sense for Acid tables:

Some data may belong to aborted transactons
Transaction IDs are imbedded into data/files names. You'd have export delta/ and base/ each of which may have files with the same names, e.g. bucket_00000.
On import these IDs won't make sense in a different cluster or even a different table (which may have delta_x_x for example for the same x (but different data of course).
Export creates a _metadata column types, storage format, etc. Perhaps it can include info about aborted IDs (if the whole file can't be skipped).
Even importing into the same table on the same cluster may be a problem. For example delta_5_5/ existed at the time of export and was included in the export. But 2 days later it may not exist because it was compacted and cleaned.
If importing back into the same table on the same cluster, the data could be imported into a different transaction (assuming per table writeIDs) w/o having to remap the IDs in the rows themselves.
support Import Overwrite?
Support Import as a new txn with remapping of ROW_IDs? The new writeID can be stored in a delta_x_x/meta_data and ROW_IDs can be remapped at read time (like isOriginal) and made permanent by compaction.
It doesn't seem reasonable to import acid data into non-acid table

Perhaps import can work similar to Load Data: look at the file imported, if it has Acid columns, leave a note in the delta_x_x/_meta_data to indicate that these columns should be skipped a new ROW_IDs assigned at read time.

Case I

Table has delta_7_7 and delta_8_8. Sine both may have bucket_0000, we could export to export_dir and rename files as bucket_0000 and bucket_0000_copy_1. Load Data supports input dir with copy_N files.

Case II

what if we have delete_delta_9_9 in the source. Now you can't just ignore ROW_IDs after import.

Only export the latest base_N? Or more generally up to the smallest deleted ROW_ID (which may be hard to find w/o scanning all deletes. The export then would have to be done under X lock to prevent new concurrent deletes)
Stash all deletes in some additional file file which on import gets added into the target delta/ so that Acid reader can apply them to the data in this delta/ but so that they don't clash with 'normal' deletes that exist in the table.
- here we may also have multiple delete_delta/ with identical file names. Does delete delta reader handle copy_N files?

Attachments

Issue Links

is superceded by

HIVE-18739 Add support for Import/Export from Acid table

Closed

relates to

HIVE-18814 Support Add Partition For Acid tables

Closed

Sub-Tasks

1.	Add support for Export from partitioned Acid table		Resolved	Eugene Koifman
2.	Add support for Import into Acid table		Closed	Eugene Koifman

Activity

People

Assignee:: Eugene Koifman

Reporter:: Eugene Koifman

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Nov/17 22:49

Updated:: 22/May/18 23:14

Resolved:: 18/Apr/18 02:17