Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21530

Replicate Streaming ingestion with transactions batch size greater than 1.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0.0
    • Fix Version/s: None
    • Component/s: repl, Transactions
    • Labels:
    • Target Version/s:

      Description

      implement replication of hive streaming ingest of tables as per Hive ACID Replication_ Streaming Ingest Tables.pdf .
      changes to txn_commit to include information about transaction batch.
      changes to copy task to only copy if there is a difference in file size or checksum, seems specific to transaction batch shouldnt be used for normal transactions.
      copy the correct sequence of files w.r.t data file + side file.
      remove side files ( which looks like are suffixed as _flush in file names) when the batch is committed.
      how do we determine the idempotent nature of the events here, update the corresponding table + partition and not copy new version of the file.
      validate if partial copied data files are handled on the target warehouse given correct side file. can we leave the side file file forever, in case during transaction batch copy after certain transactions are copied over then primary warehouse fails. we wont be able to remove _flush file, on failover do we have to handle this.

        Attachments

          Activity

            People

            • Assignee:
              pkumarsinha Pravin Sinha
              Reporter:
              sankarh Sankar Hariappan
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: