Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-1395

Spurious PostPublishStep WorkUnits with CREATE TABLE/ADD PARTITION for HIVE table copy

    XMLWordPrintableJSON

Details

    Description

      For Hive Table incremental copy, Gobblin creates spurious PostPublishSteps for Hive Registrations :

      1. Creates many PostPublishSteps with CREATE TABLE. It is observed that it creates a total of P number of PostPublishSteps with CREATE TABLE, for a source table with P number of total partitions, irrespective of partitions to be moved during the increment.
      2. Also creates PostPublishSteps with ADD PARTITIONS for already present partitions at target also, even though it does not create any CopyableFile work units for those partitions.

      Because of these spurious calls, it impacts the performance. For instance, incremental data movement duration is increasing day by day; as the number of WorkUnits gets increases.

      Steps to reproduce :

      Step 1)

      a) create a table with 5 partitions, with some rows in each partition

      hive> show partitions tc_p5_r10;
       OK
       dt=2020-12-26
       dt=2020-12-27
       dt=2020-12-28
       dt=2020-12-29
       dt=2020-12-30
       Time taken: 1.287 seconds, Fetched: 5 row(s)
      

      b) Do DataMovement with the below mentioned Job configuration
      c) Observations :

      Total No. of old partitions in the table (O) : 0
      Total No. of new partitions in the table (N): 5
      Total WorkUnits created (W): 15 ( 2 x (O+N) + N )
      CopyableFile WorkUnits: 5 (one for each partition)
      PostPublishStep WorkUnits: 10 (two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata)

      Step 2)

      a) add 5 more partitions, with some rows in each partition 

      hive> show partitions tc_p5_r10;
       OK
       dt=2020-12-26
       dt=2020-12-27
       dt=2020-12-28
       dt=2020-12-29
       dt=2020-12-30
       dt=2021-01-01
       dt=2021-01-02
       dt=2021-01-03
       dt=2021-01-04
       dt=2021-01-05
       Time taken: 0.131 seconds, Fetched: 10 row(s)
      

       Note: there is a missing partition for 31st Dec, intentionally left out for step (3)

      b) Do DataMovement with the below Job configuration
      c) Observations :

      Total No. of old partitions in the table (O): 5
      Total No. of new partitions in the table (N): 5
      Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )
      CopyableFile WorkUnits: 5 (one for each newly found partition)
      PostPublishStep WorkUnits: 20 (two for every partition in the entire table, not just for new partitions!)

      Step 3)

      a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition

      hive> show partitions tc_p5_r10;
       OK
       dt=2020-12-26
       dt=2020-12-27
       dt=2020-12-28
       dt=2020-12-29
       dt=2020-12-30
       dt=2020-12-31
       dt=2021-01-01
       dt=2021-01-02
       dt=2021-01-03
       dt=2021-01-04
       dt=2021-01-05
       Time taken: 0.101 seconds, Fetched: 11 row(s)
      

      b) Do DataMovement with the below Job configuration
      c) Observations :

      Total No. of old partitions in the table (O): 10}}
      Total No. of new partitions in the table (N): 1
      Total WorkUnits created (W): 23 ( 2 x (O+N) + N )
      CopyableFile WorkUnits: 1 (one for each newly found partition)
      PostPublishStep WorkUnits: 22 (two for every partition in the entire table, not just for new partition!)

       

       

      Job Configuration used:

      job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
      job.description=Test Gobblin job for copy
      # target location for copy
      data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
      gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
      source.filebased.fs.uri="hdfs://localhost:8020"
      hive.dataset.hive.metastore.uri=thrift://localhost:9083
      hive.dataset.copy.target.table.root=${data.publisher.final.dir}
      hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
      hive.dataset.copy.target.database=tc_db_copy_1
      hive.db.root.dir=${data.publisher.final.dir}
      # writer.fs.uri="hdfs://127.0.0.1:8020/"
      hive.dataset.whitelist=tc_db.tc_p5_r10
      gobblin.copy.recursive.update=true
      # ====================================================================
      # Distcp configurations (do not change)
      # ====================================================================
      type=hadoopJava
      job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
      extract.namespace=org.apache.gobblin.copy
       data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
      source.class=org.apache.gobblin.data.management.copy.CopySource
       writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
      converter.classes=org.apache.gobblin.converter.IdentityConverter
      task.maxretries=0
      workunit.retry.enabled=false
      distcp.persist.dir=/tmp/distcp-persist-dir

      Attachments

        Activity

          People

            abti Abhishek Tiwari
            sridivakar Sridivakar
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: