Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-13477

Containerized TaskManager killed because of lack of memory overhead

    XMLWordPrintableJSON

    Details

      Description

      Currently, the `-XX:MaxDirectMemorySize` parameter is set as:
      `MaxDirectMemorySize = containerMemoryMB - heapSizeMB`
      (see https://github.com/apache/flink/blob/7fec4392b21b07c69ba15ea554731886f181609e/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ContaineredTaskManagerParameters.java#L162)

      However as explained at
       https://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html,
      `MaxDirectMemorySize` only sets the maximum amount of memory that can be
      used for direct buffers, thus the amount of off-heap memory used can be
      greater than that value, leading to the container being killed by Mesos
      or Yarn as it exceeds the allocated memory.

      In addition, users might want to allocate off-heap memory through native
      code, in which case they will want to keep some of the container memory
      free and unallocated by Flink.

      To solve this issue, we currently set the following parameter:

      -Dcontainerized.taskmanager.env.FLINK_ENV_JAVA_OPTS='-XX:MaxDirectMemorySize=600m'
      

      which overrides the value that Flink picks (744M in this case) with a lower one to keep some overhead memory in the TaskManager containers. However this is an "ugly" hack as it goes around the clever memory allocation that Flink performs and allows to bypass the sanity checks done in `ContaineredTaskManagerParameters`.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                b.hanotte Benoit Hanotte
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m