Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-15178

TaskExecutor crashes due to mmap allocation failure for BLOCKING shuffle

    XMLWordPrintableJSON

Details

    Description

      I met this issue when running testing batch(DataSet) job with 1000 parallelism.
      Some TMs crashes due to error below:

      # There is insufficient memory for the Java Runtime Environment to continue.
      # Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
      [thread 139864559318784 also had an error]
      [thread 139867407243008 also had an error]
      

      With either of the following actions, this problem could be avoided:
      1. changing ExecutionMode from BATCH_FORCED to PIPELINED
      2. changing config "taskmanager.network.bounded-blocking-subpartition-type" from default "auto" to "file"
      So looks it is related to the mmap of BLOCKING shuffle.

      And the issue is a bit weird that it would always happen in the beginning of a job, and disappeared after several rounds of failovers, so the job would finally succeed.

      The job code and config file is attached.
      The command to run it (on a yarn cluster) is

      bin/flink run -d -m yarn-cluster -c com.alibaba.blink.tests.MultiRegionBatchNumberCount ../flink-tests-1.0-SNAPSHOT-1.10.jar --parallelism 1000
      

      sewen pnowojski kevin.cyj Do you have ideas why this issue could happen?

      Attachments

        1. flink-conf.yaml
          12 kB
          Zhu Zhu
        2. MultiRegionBatchNumberCount.java
          4 kB
          Zhu Zhu

        Activity

          People

            Unassigned Unassigned
            zhuzh Zhu Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: