Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5410

Running out of memory on Yarn

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Later
    • None
    • None
    • Deployment / YARN
    • None

    Description

      I'm running locally under this configuration(copied from nodemanager logs):
      physical-memory=8192 virtual-memory=17204 virtual-cores=8

      Before starting a flink deployment, memory usage stats show 3.7 GB used on system, indicating lots of free memory for flink containers.
      However, after I submit using minimal resource requirements, ./yarn-session.sh -n 1 -tm 768, the cluster deploys successfully but then every application on system receives a sigterm and it basically kills the current user session, logging out of the system.

      The job manager and task manager logs contain just the information that a SIGTERM was received and shut down gracefully.
      All yarn and dfs process contain the log information showing the receipt of a sigterm.

      Here's the relevant log from nodemanager:

      org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1483603191971_0002_01_000002
      2017-01-05 17:00:08,744 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 17872 for container-id container_1483603191971_0002_01_000001: 282.7 MB of 1 GB physical memory used; 2.1 GB of 2.1 GB virtual memory used
      2017-01-05 17:00:08,744 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_1483603191971_0002_01_000001 has processes older than 1 iteration running over the configured limit. Limit=2254857728, current usage = 2255896576
      2017-01-05 17:00:08,745 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=17872,containerID=container_1483603191971_0002_01_000001] is running beyond virtual memory limits. Current usage: 282.7 MB of 1 GB physical memory used; 2.1 GB of 2.1 GB virtual memory used. Killing container.
      Dump of the process-tree for container_1483603191971_0002_01_000001 :
      	|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
      	|- 17872 17870 17872 17872 (bash) 0 0 21409792 812 /bin/bash -c /usr/lib/jvm/java-8-openjdk-amd64//bin/java -Xmx424M  -Dlog.file=/opt/hadoop-2.7.3/logs/userlogs/application_1483603191971_0002/container_1483603191971_0002_01_000001/jobmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner  1>/opt/hadoop-2.7.3/logs/userlogs/application_1483603191971_0002/container_1483603191971_0002_01_000001/jobmanager.out 2>/opt/hadoop-2.7.3/logs/userlogs/application_1483603191971_0002/container_1483603191971_0002_01_000001/jobmanager.err 
      	|- 17879 17872 17872 17872 (java) 748 20 2234486784 71553 /usr/lib/jvm/java-8-openjdk-amd64//bin/java -Xmx424M -Dlog.file=/opt/hadoop-2.7.3/logs/userlogs/application_1483603191971_0002/container_1483603191971_0002_01_000001/jobmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner 
      
      2017-01-05 17:00:08,745 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Removed ProcessTree with root 17872
      2017-01-05 17:00:08,746 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1483603191971_0002_01_000001 transitioned from RUNNING to KILLING
      2017-01-05 17:00:08,746 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1483603191971_0002_01_000001
      2017-01-05 17:00:08,779 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM
      2017-01-05 17:00:08,822 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1483603191971_0002_01_000001 is : 143
      

      Is the memory available on my pc not enough or are there any known issues which might lead to this?

      Also, this doesn't occur every time I start a flink session.

      Attachments

        Activity

          People

            Unassigned Unassigned
            sachingoel0101 Sachin Goel
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: