Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-18098

Add support for Export/Import for Acid tables

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Done
    • None
    • 3.0.0
    • Transactions
    • None
    • n/a

    Description

      How should this work?
      For regular tables export just copies the files under table root to a specified directory.
      This doesn't make sense for Acid tables:

      • Some data may belong to aborted transactons
      • Transaction IDs are imbedded into data/files names. You'd have export delta/ and base/ each of which may have files with the same names, e.g. bucket_00000.
      • On import these IDs won't make sense in a different cluster or even a different table (which may have delta_x_x for example for the same x (but different data of course).
      • Export creates a _metadata column types, storage format, etc. Perhaps it can include info about aborted IDs (if the whole file can't be skipped).
      • Even importing into the same table on the same cluster may be a problem. For example delta_5_5/ existed at the time of export and was included in the export. But 2 days later it may not exist because it was compacted and cleaned.
      • If importing back into the same table on the same cluster, the data could be imported into a different transaction (assuming per table writeIDs) w/o having to remap the IDs in the rows themselves.
      • support Import Overwrite?
      • Support Import as a new txn with remapping of ROW_IDs? The new writeID can be stored in a delta_x_x/meta_data and ROW_IDs can be remapped at read time (like isOriginal) and made permanent by compaction.
      • It doesn't seem reasonable to import acid data into non-acid table

      Perhaps import can work similar to Load Data: look at the file imported, if it has Acid columns, leave a note in the delta_x_x/_meta_data to indicate that these columns should be skipped a new ROW_IDs assigned at read time.

      Case I

      Table has delta_7_7 and delta_8_8. Sine both may have bucket_0000, we could export to export_dir and rename files as bucket_0000 and bucket_0000_copy_1. Load Data supports input dir with copy_N files.

      Case II

      what if we have delete_delta_9_9 in the source. Now you can't just ignore ROW_IDs after import.

      • Only export the latest base_N? Or more generally up to the smallest deleted ROW_ID (which may be hard to find w/o scanning all deletes. The export then would have to be done under X lock to prevent new concurrent deletes)
      • Stash all deletes in some additional file file which on import gets added into the target delta/ so that Acid reader can apply them to the data in this delta/ but so that they don't clash with 'normal' deletes that exist in the table.
        • here we may also have multiple delete_delta/ with identical file names. Does delete delta reader handle copy_N files?

      Attachments

        Issue Links

          Activity

            People

              ekoifman Eugene Koifman
              ekoifman Eugene Koifman
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: