Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
Description
Problem Statement
We have noticed this performance bottleneck at EMR, and it has been reported here as well https://github.com/apache/incubator-hudi/issues/1371
Hudi for writes through DataSource API uses this to create the spark relation. Here it uses HoodieSparkSqlWriter to write the dataframe and after it tries to return a relation by creating it through parquet data source here
In the process of creating this parquet data source, Spark creates an InMemoryFileIndex here as part of which it performs file listing of the base path. While the listing itself is parallelized, the filter that we pass which is HoodieROTablePathFilter is applied sequentially on the driver side on all the 1000s of files returned during listing. This part is not parallelized by spark, and it takes a lot of time probably because of the filters logic. This causes the driver to just spend time filtering. We have seen it take 10-12 minutes to do this process for just 50 partitions in S3, and this time is spent after the writing has finished.
Solving this will significantly reduce the writing time across all sorts of writes. This time is essentially getting wasted, because we do not really have to return a relation after the write. This relation is never really used by Spark either ways here and writing process returns empty set of rows..
Proposed Solution
Proposal is to return an Empty Spark relation after the write, which will cut down all this unnecessary time spent to create a parquet relation that never gets used.
Attachments
Issue Links
- duplicates
-
HUDI-672 Spark DataSource - Upsert for S3 Hudi dataset with large partitions takes a lot of time in writing
- Closed
- links to