[FLINK-15178] TaskExecutor crashes due to mmap allocation failure for BLOCKING shuffle - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Not a Priority
Resolution: Unresolved
Affects Version/s: 1.10.0
Fix Version/s: None
Component/s: Runtime / Network
Labels:
- auto-deprioritized-major
- auto-deprioritized-minor

Description

I met this issue when running testing batch(DataSet) job with 1000 parallelism.
Some TMs crashes due to error below:

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
[thread 139864559318784 also had an error]
[thread 139867407243008 also had an error]

With either of the following actions, this problem could be avoided:
1. changing ExecutionMode from BATCH_FORCED to PIPELINED
2. changing config "taskmanager.network.bounded-blocking-subpartition-type" from default "auto" to "file"
So looks it is related to the mmap of BLOCKING shuffle.

And the issue is a bit weird that it would always happen in the beginning of a job, and disappeared after several rounds of failovers, so the job would finally succeed.

The job code and config file is attached.
The command to run it (on a yarn cluster) is

bin/flink run -d -m yarn-cluster -c com.alibaba.blink.tests.MultiRegionBatchNumberCount ../flink-tests-1.0-SNAPSHOT-1.10.jar --parallelism 1000

sewen pnowojski kevin.cyj Do you have ideas why this issue could happen?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

flink-conf.yaml
10/Dec/19 12:31
12 kB
Zhu Zhu
MultiRegionBatchNumberCount.java
10/Dec/19 12:29
4 kB
Zhu Zhu

Activity

People

Assignee:: Unassigned

Reporter:: Zhu Zhu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Dec/19 12:44

Updated:: 20/Nov/21 22:38