Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-656

Write Performance - Driver spends too much time creating Parquet DataSource after writes

    XMLWordPrintableJSON

Details

    Description

      Problem Statement

      We have noticed this performance bottleneck at EMR, and it has been reported here as well https://github.com/apache/incubator-hudi/issues/1371

      Hudi for writes through DataSource API uses this to create the spark relation. Here it uses HoodieSparkSqlWriter to write the dataframe and after it tries to return a relation by creating it through parquet data source here

      In the process of creating this parquet data source, Spark creates an InMemoryFileIndex here as part of which it performs file listing of the base path. While the listing itself is parallelized, the filter that we pass which is HoodieROTablePathFilter is applied sequentially on the driver side on all the 1000s of files returned during listing. This part is not parallelized by spark, and it takes a lot of time probably because of the filters logic. This causes the driver to just spend time filtering. We have seen it take 10-12 minutes to do this process for just 50 partitions in S3, and this time is spent after the writing has finished.

      Solving this will significantly reduce the writing time across all sorts of writes. This time is essentially getting wasted, because we do not really have to return a relation after the write. This relation is never really used by Spark either ways here and writing process returns empty set of rows..

      Proposed Solution

      Proposal is to return an Empty Spark relation after the write, which will cut down all this unnecessary time spent to create a parquet relation that never gets used.

       

       

       

      Attachments

        Issue Links

          Activity

            People

              uditme Udit Mehrotra
              uditme Udit Mehrotra
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m