Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-724

Parallelize GetSmallFiles For Partitions

    XMLWordPrintableJSON

Details

    Description

      When writing data, a gap was observed between spark stages. By tracking down where the time was spent on the spark driver, it's get-small-files operation for partitions.

      When creating the UpsertPartitioner and trying to assign insert records, it uses a normal for-loop for get the list of small files for all partitions that the load is going to load data to, and the process is very slow when there are a lot of partitions to go through. While the operation is running on spark driver process, all other worker nodes are sitting idle waiting for tasks.

      For all those partitions, they don't affect each other, so the get-small-files operations can be parallelized. The change I made is to pass the JavaSparkContext to the UpsertPartitioner, and create RDD for the partitions and eventually send the get small files operations to multiple tasks.

       

      screenshot attached for 

      the gap without the improvement

      the spark stage with the improvement (no gap)

      Attachments

        1. gap.png
          207 kB
          Feichi Feng
        2. nogapAfterImprovement.png
          94 kB
          Feichi Feng

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Feichi Feng Feichi Feng
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 48h
                  48h
                  Remaining:
                  Time Spent - 40m Remaining Estimate - 47h 20m
                  47h 20m
                  Logged:
                  Time Spent - 40m Remaining Estimate - 47h 20m
                  40m