[HUDI-656] Write Performance - Driver spends too much time creating Parquet DataSource after writes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.6.0, 0.5.3
Component/s: performance, spark
Labels:
- pull-request-available

Description

Problem Statement

We have noticed this performance bottleneck at EMR, and it has been reported here as well https://github.com/apache/incubator-hudi/issues/1371

Hudi for writes through DataSource API uses this to create the spark relation. Here it uses HoodieSparkSqlWriter to write the dataframe and after it tries to return a relation by creating it through parquet data source here

In the process of creating this parquet data source, Spark creates an InMemoryFileIndex here as part of which it performs file listing of the base path. While the listing itself is parallelized, the filter that we pass which is HoodieROTablePathFilter is applied sequentially on the driver side on all the 1000s of files returned during listing. This part is not parallelized by spark, and it takes a lot of time probably because of the filters logic. This causes the driver to just spend time filtering. We have seen it take 10-12 minutes to do this process for just 50 partitions in S3, and this time is spent after the writing has finished.

Solving this will significantly reduce the writing time across all sorts of writes. This time is essentially getting wasted, because we do not really have to return a relation after the write. This relation is never really used by Spark either ways here and writing process returns empty set of rows..

Proposed Solution

Proposal is to return an Empty Spark relation after the write, which will cut down all this unnecessary time spent to create a parquet relation that never gets used.

Attachments

Issue Links

duplicates

HUDI-672 Spark DataSource - Upsert for S3 Hudi dataset with large partitions takes a lot of time in writing

Closed

links to

GitHub Pull Request #1394

Activity

People

Assignee:: Udit Mehrotra

Reporter:: Udit Mehrotra

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Mar/20 01:17

Updated:: 10/May/20 13:21

Resolved:: 10/May/20 13:20

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m