Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20911

External Table Replication for Hive

    XMLWordPrintableJSON

Details

    Description

      External tables are not replicated currently as part of hive replication. As part of this jira we want to enable that.

      Approach:

      • Target cluster will have a top level base directory config that will be used to copy all data relevant to external tables. This will be provided via the with clause in the repl load command. This base path will be prefixed to the path of the same external table on source cluster. This can be provided using the following configuration:
        hive.repl.replica.external.table.base.dir=/
        
      • Since changes to directories on the external table can happen without hive knowing it, hence we cant capture the relevant events when ever new data is added or removed, we will have to copy the data from the source path to target path for external tables every time we run incremental replication.
        • this will require incremental repl dump to now create an additional file _external_tables_info with data in the following form
          tableName,base64Encoded(tableDataLocation)
          

          In case there are different partitions in the table pointing to different locations there will be multiple entries in the file for the same table name with location pointing to different partition locations. For partitions created in a table without specifying the set location command will be within the same table Data location and hence there will not be different entries in the file above

        • repl load will read the _external_tables_info to identify what locations are to be copied from source to target and create corresponding tasks for them.
      • New External tables will be created with metadata only with no data copied as part of regular tasks while incremental load/bootstrap load.
      • Bootstrap dump will also create _external_tables_info which will be used to copy data from source to target as part of boostrap load.
      • Bootstrap load will create a DAG, that can use parallelism in the execution phase, the hdfs copy related tasks are created, once the bootstrap phase is complete.
      • Since incremental load results in a DAG with only sequential execution ( events applied in sequence ) to effectively use the parallelism capability in execution mode, we create tasks for hdfs copy along with the incremental DAG. This requires a few basic calculations to approximately meet the configured value in "hive.repl.approx.max.load.tasks"

      Attachments

        1. HIVE-20911.01.patch
          138 kB
          Anishek Agarwal
        2. HIVE-20911.02.patch
          144 kB
          Anishek Agarwal
        3. HIVE-20911.03.patch
          148 kB
          Anishek Agarwal
        4. HIVE-20911.04.patch
          151 kB
          Anishek Agarwal
        5. HIVE-20911.05.patch
          150 kB
          Anishek Agarwal
        6. HIVE-20911.06.patch
          152 kB
          Anishek Agarwal
        7. HIVE-20911.07.patch
          159 kB
          Anishek Agarwal
        8. HIVE-20911.07.patch
          159 kB
          Anishek Agarwal
        9. HIVE-20911.08.patch
          169 kB
          Anishek Agarwal
        10. HIVE-20911.08.patch
          169 kB
          Anishek Agarwal
        11. HIVE-20911.09.patch
          176 kB
          Anishek Agarwal
        12. HIVE-20911.10.patch
          176 kB
          Anishek Agarwal
        13. HIVE-20911.11.patch
          176 kB
          Anishek Agarwal
        14. HIVE-20911.12.patch
          177 kB
          Anishek Agarwal
        15. HIVE-20911.12.patch
          177 kB
          Anishek Agarwal

        Activity

          People

            anishek Anishek Agarwal
            anishek Anishek Agarwal
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 20m
                20m