Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-494

[DEBUGGING] Huge amount of tasks when writing files into HDFS

    XMLWordPrintableJSON

Details

    Description

      I am using the manual build master after https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65 commit. EDIT: tried with the latest master but got the same result

      I am seeing 3 million tasks when the Hudi Spark job writing the files into HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 

      I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ folder in my HDFS. In the Spark UI, each task only writes less than 10 records in

      count at HoodieSparkSqlWriter

       All the stages before this seem normal. Any idea what happened here? My first guess would be something related to the bloom filter index. Maybe somewhere trigger the repartitioning with the bloom filter index? But I am not really familiar with that part of the code. 

      Thanks

       

      Attachments

        1. Screen Shot 2020-01-02 at 8.53.44 PM.png
          122 kB
          Yanjia Gary Li
        2. Screen Shot 2020-01-02 at 8.53.24 PM.png
          163 kB
          Yanjia Gary Li
        3. image-2020-01-05-07-30-53-567.png
          119 kB
          lamber-ken
        4. example2_sparkui.png
          115 kB
          Yanjia Gary Li
        5. example2_hdfs.png
          131 kB
          Yanjia Gary Li

        Issue Links

          Activity

            People

              garyli1019 Yanjia Gary Li
              garyli1019 Yanjia Gary Li
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: