Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Not A Problem
-
None
-
None
-
None
Description
Hi Team,
I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that sortBy at GlobalSortPartitioner.java:41 is running twice.
It is getting triggered at 1 stage. refer this screenshot ->1st.png.
Second time it is getting trigged from HoodieSparkSqlWriter.scala:433 count at HoodieSparkSqlWriter.scala:433 step.
In both cases, same number of job got triggered and running time is close to each other. Refer this screenshot -> 2nd.png
Is there any way to run only one time so that data can be loaded faster or it is expected behaviour.
Spark and Hudi configurations
Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0
Hudi Configuration
"hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 100000000 "hoodie.parquet.max.file.size" = 128000000 "hoodie.index.bloom.num_entries" = 1800000 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 2500000 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
Spark Configuration -
--num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m