Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1668

GlobalSortPartitioner is getting called twice during bulk_insert.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Not A Problem
    • None
    • None
    • None

    Description

      Hi Team,

      I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that sortBy at GlobalSortPartitioner.java:41 is running twice. 

      It is getting triggered at 1 stage. refer this screenshot ->1st.png.

      Second time it is getting trigged from  HoodieSparkSqlWriter.scala:433 count at HoodieSparkSqlWriter.scala:433   step.

      In both cases, same number of job got triggered and running time is close to each other. Refer this screenshot -> 2nd.png

      Is there any way to run only one time so that data can be loaded faster or it is expected behaviour.

      Spark and Hudi configurations

       

      Spark - 2.3.0
      Scala- 2.11.12
      Hudi - 0.7.0
       
      

       

      Hudi Configuration

      "hoodie.cleaner.commits.retained" = 2  
      "hoodie.bulkinsert.shuffle.parallelism"=2000  
      "hoodie.parquet.small.file.limit" = 100000000  
      "hoodie.parquet.max.file.size" = 128000000  
      "hoodie.index.bloom.num_entries" = 1800000  
      "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
      "hoodie.bloom.index.filter.dynamic.max.entries" = 2500000  
      "hoodie.bloom.index.bucketized.checking" = "false"  
      "hoodie.datasource.write.operation" = "bulk_insert"  
      "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
      

       

      Spark Configuration -

      --num-executors 180 
      --executor-cores 4 
      --executor-memory 16g 
      --driver-memory=24g 
      --conf spark.rdd.compress=true 
      --queue=default 
      --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
      --conf spark.executor.memoryOverhead=1600 
      --conf spark.driver.memoryOverhead=1200 
      --conf spark.driver.maxResultSize=2g
      --conf spark.kryoserializer.buffer.max=512m 
      
      
      
      

      Attachments

        1. 1st.png
          423 kB
          Sugamber
        2. 2nd.png
          234 kB
          Sugamber
        3. Screen Shot 2021-04-17 at 11.23.17 AM.png
          561 kB
          sivabalan narayanan
        4. Screenshot 2021-04-21 at 6.40.19 PM.png
          430 kB
          Sugamber
        5. Screenshot 2021-04-21 at 6.40.40 PM.png
          245 kB
          Sugamber

        Activity

          People

            nishith29 Nishith Agarwal
            sugamberku Sugamber
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: