[GOBBLIN-1395] Spurious PostPublishStep WorkUnits with CREATE TABLE/ADD PARTITION for HIVE table copy - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 0.15.0
Fix Version/s: None
Component/s: hive-registration
Labels:
- Hive
- PayPal
- Performance

Description

For Hive Table incremental copy, Gobblin creates spurious PostPublishSteps for Hive Registrations :

Creates many PostPublishSteps with CREATE TABLE. It is observed that it creates a total of P number of PostPublishSteps with CREATE TABLE, for a source table with P number of total partitions, irrespective of partitions to be moved during the increment.
Also creates PostPublishSteps with ADD PARTITIONS for already present partitions at target also, even though it does not create any CopyableFile work units for those partitions.

Because of these spurious calls, it impacts the performance. For instance, incremental data movement duration is increasing day by day; as the number of WorkUnits gets increases.

Steps to reproduce :

Step 1)

a) create a table with 5 partitions, with some rows in each partition

hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 Time taken: 1.287 seconds, Fetched: 5 row(s)

b) Do DataMovement with the below mentioned Job configuration
c) Observations :

Total No. of old partitions in the table (O) : 0
Total No. of new partitions in the table (N): 5
Total WorkUnits created (W): 15 ( 2 x (O+N) + N )
CopyableFile WorkUnits: 5 (one for each partition)
PostPublishStep WorkUnits: 10 (two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata)

Step 2)

a) add 5 more partitions, with some rows in each partition

hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 dt=2021-01-01
 dt=2021-01-02
 dt=2021-01-03
 dt=2021-01-04
 dt=2021-01-05
 Time taken: 0.131 seconds, Fetched: 10 row(s)

Note: there is a missing partition for 31st Dec, intentionally left out for step (3)

b) Do DataMovement with the below Job configuration
c) Observations :

Total No. of old partitions in the table (O): 5
Total No. of new partitions in the table (N): 5
Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )
CopyableFile WorkUnits: 5 (one for each newly found partition)
PostPublishStep WorkUnits: 20 (two for every partition in the entire table, not just for new partitions!)

Step 3)

a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition

hive> show partitions tc_p5_r10;
 OK
 dt=2020-12-26
 dt=2020-12-27
 dt=2020-12-28
 dt=2020-12-29
 dt=2020-12-30
 dt=2020-12-31
 dt=2021-01-01
 dt=2021-01-02
 dt=2021-01-03
 dt=2021-01-04
 dt=2021-01-05
 Time taken: 0.101 seconds, Fetched: 11 row(s)

b) Do DataMovement with the below Job configuration
c) Observations :

Total No. of old partitions in the table (O): 10}}
Total No. of new partitions in the table (N): 1
Total WorkUnits created (W): 23 ( 2 x (O+N) + N )
CopyableFile WorkUnits: 1 (one for each newly found partition)
PostPublishStep WorkUnits: 22 (two for every partition in the entire table, not just for new partition!)

Job Configuration used:

job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
job.description=Test Gobblin job for copy
# target location for copy
data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
source.filebased.fs.uri="hdfs://localhost:8020"
hive.dataset.hive.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.table.root=${data.publisher.final.dir}
hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
hive.dataset.copy.target.database=tc_db_copy_1
hive.db.root.dir=${data.publisher.final.dir}
# writer.fs.uri="hdfs://127.0.0.1:8020/"
hive.dataset.whitelist=tc_db.tc_p5_r10
gobblin.copy.recursive.update=true
# ====================================================================
# Distcp configurations (do not change)
# ====================================================================
type=hadoopJava
job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
extract.namespace=org.apache.gobblin.copy
 data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
source.class=org.apache.gobblin.data.management.copy.CopySource
 writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
converter.classes=org.apache.gobblin.converter.IdentityConverter
task.maxretries=0
workunit.retry.enabled=false
distcp.persist.dir=/tmp/distcp-persist-dir

Spurious PostPublishStep WorkUnits with CREATE TABLE/ADD PARTITION for HIVE table copy

Details

Description

Step 1)

Step 2)

Step 3)

Job Configuration used:

Attachments

Activity

People

Dates