[HUDI-494] [DEBUGGING] Huge amount of tasks when writing files into HDFS - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Test
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.6.0
Component/s: None
Labels:
- bug-bash-0.6.0
- pull-request-available

Description

I am using the manual build master after https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65 commit. EDIT: tried with the latest master but got the same result

I am seeing 3 million tasks when the Hudi Spark job writing the files into HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms.

I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ folder in my HDFS. In the Spark UI, each task only writes less than 10 records in

count at HoodieSparkSqlWriter

All the stages before this seem normal. Any idea what happened here? My first guess would be something related to the bloom filter index. Maybe somewhere trigger the repartitioning with the bloom filter index? But I am not really familiar with that part of the code.

Thanks

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screen Shot 2020-01-02 at 8.53.44 PM.png
03/Jan/20 05:07
122 kB
Yanjia Gary Li
Screen Shot 2020-01-02 at 8.53.24 PM.png
03/Jan/20 05:07
163 kB
Yanjia Gary Li
image-2020-01-05-07-30-53-567.png
04/Jan/20 23:30
119 kB
lamber-ken
example2_sparkui.png
06/May/20 16:37
115 kB
Yanjia Gary Li
example2_hdfs.png
06/May/20 16:37
131 kB
Yanjia Gary Li

Issue Links

is depended upon by

HUDI-901 Bug Bash 0.6.0 Tracking Ticket

Resolved

links to

GitHub Pull Request #1602

Activity

People

Assignee:: Yanjia Gary Li

Reporter:: Yanjia Gary Li

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 03/Jan/20 05:07

Updated:: 09/Jun/20 19:37

Resolved:: 09/Jun/20 19:36