Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
0.15.0
-
None
Description
For Hive Table incremental copy, Gobblin creates spurious PostPublishSteps for Hive Registrations :
- Creates many PostPublishSteps with CREATE TABLE. It is observed that it creates a total of P number of PostPublishSteps with CREATE TABLE, for a source table with P number of total partitions, irrespective of partitions to be moved during the increment.
- Also creates PostPublishSteps with ADD PARTITIONS for already present partitions at target also, even though it does not create any CopyableFile work units for those partitions.
Because of these spurious calls, it impacts the performance. For instance, incremental data movement duration is increasing day by day; as the number of WorkUnits gets increases.
Steps to reproduce :
Step 1)
a) create a table with 5 partitions, with some rows in each partition
hive> show partitions tc_p5_r10; OK dt=2020-12-26 dt=2020-12-27 dt=2020-12-28 dt=2020-12-29 dt=2020-12-30 Time taken: 1.287 seconds, Fetched: 5 row(s)
b) Do DataMovement with the below mentioned Job configuration
c) Observations :
Total No. of old partitions in the table (O) : 0
Total No. of new partitions in the table (N): 5
Total WorkUnits created (W): 15 ( 2 x (O+N) + N )
CopyableFile WorkUnits: 5 (one for each partition)
PostPublishStep WorkUnits: 10 (two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata)
Step 2)
a) add 5 more partitions, with some rows in each partition
hive> show partitions tc_p5_r10; OK dt=2020-12-26 dt=2020-12-27 dt=2020-12-28 dt=2020-12-29 dt=2020-12-30 dt=2021-01-01 dt=2021-01-02 dt=2021-01-03 dt=2021-01-04 dt=2021-01-05 Time taken: 0.131 seconds, Fetched: 10 row(s)
Note: there is a missing partition for 31st Dec, intentionally left out for step (3)
b) Do DataMovement with the below Job configuration
c) Observations :
Total No. of old partitions in the table (O): 5
Total No. of new partitions in the table (N): 5
Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )
CopyableFile WorkUnits: 5 (one for each newly found partition)
PostPublishStep WorkUnits: 20 (two for every partition in the entire table, not just for new partitions!)
Step 3)
a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition
hive> show partitions tc_p5_r10; OK dt=2020-12-26 dt=2020-12-27 dt=2020-12-28 dt=2020-12-29 dt=2020-12-30 dt=2020-12-31 dt=2021-01-01 dt=2021-01-02 dt=2021-01-03 dt=2021-01-04 dt=2021-01-05 Time taken: 0.101 seconds, Fetched: 11 row(s)
b) Do DataMovement with the below Job configuration
c) Observations :
Total No. of old partitions in the table (O): 10}}
Total No. of new partitions in the table (N): 1
Total WorkUnits created (W): 23 ( 2 x (O+N) + N )
CopyableFile WorkUnits: 1 (one for each newly found partition)
PostPublishStep WorkUnits: 22 (two for every partition in the entire table, not just for new partition!)
Job Configuration used:
job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-* job.description=Test Gobblin job for copy # target location for copy data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder source.filebased.fs.uri="hdfs://localhost:8020" hive.dataset.hive.metastore.uri=thrift://localhost:9083 hive.dataset.copy.target.table.root=${data.publisher.final.dir} hive.dataset.copy.target.metastore.uri=thrift://localhost:9083 hive.dataset.copy.target.database=tc_db_copy_1 hive.db.root.dir=${data.publisher.final.dir} # writer.fs.uri="hdfs://127.0.0.1:8020/" hive.dataset.whitelist=tc_db.tc_p5_r10 gobblin.copy.recursive.update=true # ==================================================================== # Distcp configurations (do not change) # ==================================================================== type=hadoopJava job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher extract.namespace=org.apache.gobblin.copy data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher source.class=org.apache.gobblin.data.management.copy.CopySource writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder converter.classes=org.apache.gobblin.converter.IdentityConverter task.maxretries=0 workunit.retry.enabled=false distcp.persist.dir=/tmp/distcp-persist-dir