[HUDI-1668] GlobalSortPartitioner is getting called twice during bulk_insert. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- sev:high
- user-support-issues

Description

Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that sortBy at GlobalSortPartitioner.java:41 is running twice.

It is getting triggered at 1 stage. refer this screenshot ->1st.png.

Second time it is getting trigged from HoodieSparkSqlWriter.scala:433 count at HoodieSparkSqlWriter.scala:433 step.

In both cases, same number of job got triggered and running time is close to each other. Refer this screenshot -> 2nd.png

Is there any way to run only one time so that data can be loaded faster or it is expected behaviour.

Spark and Hudi configurations

Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0

Hudi Configuration

"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 100000000  
"hoodie.parquet.max.file.size" = 128000000  
"hoodie.index.bloom.num_entries" = 1800000  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 2500000  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"

Spark Configuration -

--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

1st.png
05/Mar/21 10:33
423 kB
Sugamber
2nd.png
05/Mar/21 10:33
234 kB
Sugamber
Screen Shot 2021-04-17 at 11.23.17 AM.png
17/Apr/21 15:24
561 kB
sivabalan narayanan
Screenshot 2021-04-21 at 6.40.19 PM.png
21/Apr/21 13:12
430 kB
Sugamber
Screenshot 2021-04-21 at 6.40.40 PM.png
21/Apr/21 13:12
245 kB
Sugamber

Activity

People

Assignee:: Nishith Agarwal

Reporter:: Sugamber

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 05/Mar/21 10:33

Updated:: 07/Jun/21 21:49

Resolved:: 07/Jun/21 21:49