Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4485

Finished jobs in yarn session fill /tmp filesystem

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.2.0, 1.1.3
    • Component/s: JobManager
    • Labels:
      None

      Description

      On a Yarn cluster I start a yarn-session with a few containers and task slots.
      Then I fire a 'large' number of Flink batch jobs in sequence against this yarn session. It is the exact same job (java code) yet it gets different parameters.

      In this scenario it is exporting HBase tables to files in HDFS and the parameters are about which data from which tables and the name of the target directory.

      After running several dozen jobs the jobs submission started to fail and we investigated.

      We found that the cause was that on the Yarn node which was hosting the jobmanager the /tmp file system was full (4GB was 100% full).

      How ever the output of du -hcs /tmp showed only 200MB in use.

      We found that a very large file (we guess it is the jar of the job) was put in /tmp , used, deleted yet the file handle was not closed by the jobmanager.

      As soon as we killed the jobmanager the disk space was freed.

      The summary of the impact of this is that a yarn-session that receives enough jobs brings down the Yarn node for all users.

      See parts of the output we got from lsof below.

      COMMAND     PID      USER   FD      TYPE             DEVICE      SIZE       NODE NAME
      java      15034   nbasjes  550r      REG             253,17  66219695        245 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000003 (deleted)
      java      15034   nbasjes  551r      REG             253,17  66219695        252 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000007 (deleted)
      java      15034   nbasjes  552r      REG             253,17  66219695        267 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000012 (deleted)
      java      15034   nbasjes  553r      REG             253,17  66219695        250 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000005 (deleted)
      java      15034   nbasjes  554r      REG             253,17  66219695        288 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000018 (deleted)
      java      15034   nbasjes  555r      REG             253,17  66219695        298 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000025 (deleted)
      java      15034   nbasjes  557r      REG             253,17  66219695        254 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000008 (deleted)
      java      15034   nbasjes  558r      REG             253,17  66219695        292 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000019 (deleted)
      java      15034   nbasjes  559r      REG             253,17  66219695        275 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000013 (deleted)
      java      15034   nbasjes  560r      REG             253,17  66219695        159 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000002 (deleted)
      java      15034   nbasjes  562r      REG             253,17  66219695        238 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000001 (deleted)
      java      15034   nbasjes  568r      REG             253,17  66219695        246 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000004 (deleted)
      java      15034   nbasjes  569r      REG             253,17  66219695        255 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000009 (deleted)
      java      15034   nbasjes  571r      REG             253,17  66219695        299 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000026 (deleted)
      java      15034   nbasjes  572r      REG             253,17  66219695        293 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000020 (deleted)
      java      15034   nbasjes  574r      REG             253,17  66219695        256 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000010 (deleted)
      java      15034   nbasjes  575r      REG             253,17  66219695        302 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000029 (deleted)
      java      15034   nbasjes  576r      REG             253,17  66219695        294 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000021 (deleted)
      java      15034   nbasjes  577r      REG             253,17  66219695        262 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000011 (deleted)
      java      15034   nbasjes  578r      REG             253,17  66219695        251 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000006 (deleted)
      java      15034   nbasjes  580r      REG             253,17  66219695        295 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000022 (deleted)
      java      15034   nbasjes  581r      REG             253,17  66219695        300 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000027 (deleted)
      java      15034   nbasjes  582r      REG             253,17  66219695        188 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/cache/blob_e318d1698aa6e7dc91e5f4a9f8ba29781aebd8c4 (deleted)
      java      15034   nbasjes  585r      REG             253,17  66219695        279 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000014 (deleted)
      java      15034   nbasjes  586r      REG             253,17  66219695        296 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000023 (deleted)
      java      15034   nbasjes  588r      REG             253,17  66219695        301 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000028 (deleted)
      java      15034   nbasjes  589r      REG             253,17  66219695        297 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000024 (deleted)
      java      15034   nbasjes  598r      REG             253,17  66219695        280 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000015 (deleted)
      java      15034   nbasjes  601r      REG             253,17  66219695        289 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000016 (deleted)
      java      15034   nbasjes  604r      REG             253,17  66219695        284 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000017 (deleted)
      

        Issue Links

          Activity

          Hide
          nielsbasjes Niels Basjes added a comment -

          I just reproduced the effect on a non-secure Yarn cluster.
          After having run a few jobs I see this on the node where the jobmanager runs:

          [root@node1 ~]# lsof | fgrep '/tmp/blobStore'
          java      15358          yarn  mem       REG                8,3  70243224   25936270 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/cache/blob_501262b25ff9158ff07ee1f4264b5e3afeaaf69f
          java      15358          yarn  DEL       REG                8,3             25936269 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000027
          java      15358          yarn  DEL       REG                8,3             25936268 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000026
          java      15358          yarn  DEL       REG                8,3             25936267 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000025
          java      15358          yarn  DEL       REG                8,3             25936266 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000024
          java      15358          yarn  DEL       REG                8,3             25936265 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000023
          java      15358          yarn  DEL       REG                8,3             25936264 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000022
          java      15358          yarn  DEL       REG                8,3             25936263 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000021
          java      15358          yarn  DEL       REG                8,3             25936258 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000020
          java      15358          yarn  DEL       REG                8,3             25936257 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000019
          java      15358          yarn  DEL       REG                8,3             25936260 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000018
          java      15358          yarn  DEL       REG                8,3             25936259 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000017
          java      15358          yarn  DEL       REG                8,3             25936256 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000016
          java      15358          yarn  DEL       REG                8,3             25936255 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000015
          java      15358          yarn  DEL       REG                8,3             25936254 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000014
          java      15358          yarn  DEL       REG                8,3             25936253 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000013
          java      15358          yarn  DEL       REG                8,3             25936252 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000012
          java      15358          yarn  DEL       REG                8,3             25936251 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000011
          java      15358          yarn  DEL       REG                8,3             25936250 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000010
          java      15358          yarn  DEL       REG                8,3             25936249 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000009
          java      15358          yarn  DEL       REG                8,3             25936248 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000008
          java      15358          yarn  DEL       REG                8,3             25936247 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000007
          java      15358          yarn  DEL       REG                8,3             25936246 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000006
          java      15358          yarn  DEL       REG                8,3             25936244 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000005
          java      15358          yarn  DEL       REG                8,3             25936222 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000004
          java      15358          yarn  DEL       REG                8,3             25936221 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000003
          java      15358          yarn  DEL       REG                8,3             25936220 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000002
          java      15358          yarn  DEL       REG                8,3             25936215 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000001
          java      15358          yarn  422r      REG                8,3  70243224   25936222 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000004 (deleted)
          java      15358          yarn  581u      REG                8,3  70243224   25936265 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000023 (deleted)
          java      15358          yarn  582u      REG                8,3  70243224   25936267 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000025 (deleted)
          java      15358          yarn  583r      REG                8,3  70243224   25936246 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000006 (deleted)
          java      15358          yarn  584r      REG                8,3  70243224   25936215 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000001 (deleted)
          java      15358          yarn  590u      REG                8,3  70243224   25936266 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000024 (deleted)
          java      15358          yarn  591r      REG                8,3  70243224   25936220 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000002 (deleted)
          java      15358          yarn  593r      REG                8,3  70243224   25936221 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000003 (deleted)
          java      15358          yarn  594u      REG                8,3  70243224   25936268 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000026 (deleted)
          java      15358          yarn  595u      REG                8,3  70243224   25936270 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/cache/blob_501262b25ff9158ff07ee1f4264b5e3afeaaf69f
          java      15358          yarn  597r      REG                8,3  70243224   25936255 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000015 (deleted)
          java      15358          yarn  598u      REG                8,3  70243224   25936269 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000027 (deleted)
          java      15358          yarn  599r      REG                8,3  70243224   25936252 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000012 (deleted)
          java      15358          yarn  600r      REG                8,3  70243224   25936250 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000010 (deleted)
          java      15358          yarn  601r      REG                8,3  70243224   25936254 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000014 (deleted)
          java      15358          yarn  602r      REG                8,3  70243224   25936244 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000005 (deleted)
          java      15358          yarn  603r      REG                8,3  70243224   25936259 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000017 (deleted)
          java      15358          yarn  604r      REG                8,3  70243224   25936248 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000008 (deleted)
          java      15358          yarn  605r      REG                8,3  70243224   25936260 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000018 (deleted)
          java      15358          yarn  607r      REG                8,3  70243224   25936257 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000019 (deleted)
          java      15358          yarn  608r      REG                8,3  70243224   25936258 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000020 (deleted)
          java      15358          yarn  609r      REG                8,3  70243224   25936263 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000021 (deleted)
          java      15358          yarn  610r      REG                8,3  70243224   25936264 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000022 (deleted)
          java      15358          yarn  613r      REG                8,3  70243224   25936247 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000007 (deleted)
          java      15358          yarn  617r      REG                8,3  70243224   25936253 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000013 (deleted)
          java      15358          yarn  618r      REG                8,3  70243224   25936251 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000011 (deleted)
          java      15358          yarn  619r      REG                8,3  70243224   25936249 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000009 (deleted)
          java      15358          yarn  631r      REG                8,3  70243224   25936256 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000016 (deleted)
          java      15454          yarn  mem       REG                8,3  70243224   25936219 /tmp/blobStore-087a0b08-ee59-4d21-8523-c78a79984a4a/cache/blob_501262b25ff9158ff07ee1f4264b5e3afeaaf69f
          java      15454          yarn  490r      REG                8,3  70243224   25936219 /tmp/blobStore-087a0b08-ee59-4d21-8523-c78a79984a4a/cache/blob_501262b25ff9158ff07ee1f4264b5e3afeaaf69f
          

          The two process ids you see here are:

          yarn     15358  4.9  0.3 1362160 431128 ?      Sl   15:24   1:52  |       \_ /usr/lib/jvm/jre/bin/java -Xmx424M -Dlog.file=/var/log/hadoop-yarn/containers/application_1464009968005_2639/container_1464009968005_2639_01_000001/jobmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner
          yarn     15454 10.1  0.6 1306404 801228 ?      Sl   15:24   3:51          \_ /usr/lib/jvm/jre/bin/java -Xms424m -Xmx424m -XX:MaxDirectMemorySize=424m -Dlog.file=/var/log/hadoop-yarn/containers/application_1464009968005_2639/container_1464009968005_2639_01_000002/taskmanager.log -Dlogback.configurationFile=file:./logback.xml -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskManager --configDir .
          Show
          nielsbasjes Niels Basjes added a comment - I just reproduced the effect on a non-secure Yarn cluster. After having run a few jobs I see this on the node where the jobmanager runs: [root@node1 ~]# lsof | fgrep '/tmp/blobStore' java 15358 yarn mem REG 8,3 70243224 25936270 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/cache/blob_501262b25ff9158ff07ee1f4264b5e3afeaaf69f java 15358 yarn DEL REG 8,3 25936269 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000027 java 15358 yarn DEL REG 8,3 25936268 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000026 java 15358 yarn DEL REG 8,3 25936267 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000025 java 15358 yarn DEL REG 8,3 25936266 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000024 java 15358 yarn DEL REG 8,3 25936265 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000023 java 15358 yarn DEL REG 8,3 25936264 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000022 java 15358 yarn DEL REG 8,3 25936263 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000021 java 15358 yarn DEL REG 8,3 25936258 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000020 java 15358 yarn DEL REG 8,3 25936257 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000019 java 15358 yarn DEL REG 8,3 25936260 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000018 java 15358 yarn DEL REG 8,3 25936259 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000017 java 15358 yarn DEL REG 8,3 25936256 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000016 java 15358 yarn DEL REG 8,3 25936255 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000015 java 15358 yarn DEL REG 8,3 25936254 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000014 java 15358 yarn DEL REG 8,3 25936253 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000013 java 15358 yarn DEL REG 8,3 25936252 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000012 java 15358 yarn DEL REG 8,3 25936251 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000011 java 15358 yarn DEL REG 8,3 25936250 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000010 java 15358 yarn DEL REG 8,3 25936249 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000009 java 15358 yarn DEL REG 8,3 25936248 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000008 java 15358 yarn DEL REG 8,3 25936247 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000007 java 15358 yarn DEL REG 8,3 25936246 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000006 java 15358 yarn DEL REG 8,3 25936244 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000005 java 15358 yarn DEL REG 8,3 25936222 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000004 java 15358 yarn DEL REG 8,3 25936221 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000003 java 15358 yarn DEL REG 8,3 25936220 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000002 java 15358 yarn DEL REG 8,3 25936215 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000001 java 15358 yarn 422r REG 8,3 70243224 25936222 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000004 (deleted) java 15358 yarn 581u REG 8,3 70243224 25936265 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000023 (deleted) java 15358 yarn 582u REG 8,3 70243224 25936267 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000025 (deleted) java 15358 yarn 583r REG 8,3 70243224 25936246 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000006 (deleted) java 15358 yarn 584r REG 8,3 70243224 25936215 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000001 (deleted) java 15358 yarn 590u REG 8,3 70243224 25936266 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000024 (deleted) java 15358 yarn 591r REG 8,3 70243224 25936220 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000002 (deleted) java 15358 yarn 593r REG 8,3 70243224 25936221 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000003 (deleted) java 15358 yarn 594u REG 8,3 70243224 25936268 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000026 (deleted) java 15358 yarn 595u REG 8,3 70243224 25936270 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/cache/blob_501262b25ff9158ff07ee1f4264b5e3afeaaf69f java 15358 yarn 597r REG 8,3 70243224 25936255 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000015 (deleted) java 15358 yarn 598u REG 8,3 70243224 25936269 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000027 (deleted) java 15358 yarn 599r REG 8,3 70243224 25936252 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000012 (deleted) java 15358 yarn 600r REG 8,3 70243224 25936250 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000010 (deleted) java 15358 yarn 601r REG 8,3 70243224 25936254 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000014 (deleted) java 15358 yarn 602r REG 8,3 70243224 25936244 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000005 (deleted) java 15358 yarn 603r REG 8,3 70243224 25936259 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000017 (deleted) java 15358 yarn 604r REG 8,3 70243224 25936248 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000008 (deleted) java 15358 yarn 605r REG 8,3 70243224 25936260 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000018 (deleted) java 15358 yarn 607r REG 8,3 70243224 25936257 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000019 (deleted) java 15358 yarn 608r REG 8,3 70243224 25936258 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000020 (deleted) java 15358 yarn 609r REG 8,3 70243224 25936263 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000021 (deleted) java 15358 yarn 610r REG 8,3 70243224 25936264 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000022 (deleted) java 15358 yarn 613r REG 8,3 70243224 25936247 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000007 (deleted) java 15358 yarn 617r REG 8,3 70243224 25936253 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000013 (deleted) java 15358 yarn 618r REG 8,3 70243224 25936251 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000011 (deleted) java 15358 yarn 619r REG 8,3 70243224 25936249 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000009 (deleted) java 15358 yarn 631r REG 8,3 70243224 25936256 /tmp/blobStore-0864a537-f6fa-4b27-9b7f-8cb5a3722c3e/incoming/temp-00000016 (deleted) java 15454 yarn mem REG 8,3 70243224 25936219 /tmp/blobStore-087a0b08-ee59-4d21-8523-c78a79984a4a/cache/blob_501262b25ff9158ff07ee1f4264b5e3afeaaf69f java 15454 yarn 490r REG 8,3 70243224 25936219 /tmp/blobStore-087a0b08-ee59-4d21-8523-c78a79984a4a/cache/blob_501262b25ff9158ff07ee1f4264b5e3afeaaf69f The two process ids you see here are: yarn 15358 4.9 0.3 1362160 431128 ? Sl 15:24 1:52 | \_ /usr/lib/jvm/jre/bin/java -Xmx424M -Dlog.file=/ var /log/hadoop-yarn/containers/application_1464009968005_2639/container_1464009968005_2639_01_000001/jobmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner yarn 15454 10.1 0.6 1306404 801228 ? Sl 15:24 3:51 \_ /usr/lib/jvm/jre/bin/java -Xms424m -Xmx424m -XX:MaxDirectMemorySize=424m -Dlog.file=/ var /log/hadoop-yarn/containers/application_1464009968005_2639/container_1464009968005_2639_01_000002/taskmanager.log -Dlogback.configurationFile=file:./logback.xml -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskManager --configDir .
          Hide
          StephanEwen Stephan Ewen added a comment -

          Thanks for opening this and analyzing it so thoroughly.
          Do you have a patch ready? If not, we'll try to address this in the next days...

          Show
          StephanEwen Stephan Ewen added a comment - Thanks for opening this and analyzing it so thoroughly. Do you have a patch ready? If not, we'll try to address this in the next days...
          Hide
          nielsbasjes Niels Basjes added a comment -

          No, I had a quick look at the code and I could not find where the file handle was opened that should be closed along with the delete.

          Show
          nielsbasjes Niels Basjes added a comment - No, I had a quick look at the code and I could not find where the file handle was opened that should be closed along with the delete.
          Hide
          mxm Maximilian Michels added a comment -

          Thanks for reportig Niels Basjes. It could have to do with Flink keeping the user classloader after job completion which we changed recently. The classloader probably holds a reference to all its supplied files. Did you make changes to the number of old jobs to keep? The default is 5 but it can be adjusted.

          Show
          mxm Maximilian Michels added a comment - Thanks for reportig Niels Basjes . It could have to do with Flink keeping the user classloader after job completion which we changed recently. The classloader probably holds a reference to all its supplied files. Did you make changes to the number of old jobs to keep? The default is 5 but it can be adjusted.
          Hide
          rmetzger Robert Metzger added a comment -

          I looked a bit into the issue because I'm wondering whether we want to include the fix into the 1.1.2 release.

          So to me it seems that the BlobLibraryCacheManager is doing everything as expected.
          I wonder whether we need to close the URLClassloader when removing the job from the JobManager.

          Show
          rmetzger Robert Metzger added a comment - I looked a bit into the issue because I'm wondering whether we want to include the fix into the 1.1.2 release. So to me it seems that the BlobLibraryCacheManager is doing everything as expected. I wonder whether we need to close the URLClassloader when removing the job from the JobManager.
          Hide
          rmetzger Robert Metzger added a comment -

          Sorry, I didn't refresh before writing a comment.

          Show
          rmetzger Robert Metzger added a comment - Sorry, I didn't refresh before writing a comment.
          Hide
          nielsbasjes Niels Basjes added a comment -

          I didn't change that setting.
          In the web UI I see only the last 5 jobs that ran and it shows 26 jobs completed and 2 failed.

          When I look at the lsof output I see 28 open files with the status 'deleted' (see previous comment).
          The last of those jobs finished about 4 days ago (before the weekend) and the files are still open.

          Show
          nielsbasjes Niels Basjes added a comment - I didn't change that setting. In the web UI I see only the last 5 jobs that ran and it shows 26 jobs completed and 2 failed. When I look at the lsof output I see 28 open files with the status 'deleted' (see previous comment). The last of those jobs finished about 4 days ago (before the weekend) and the files are still open.
          Hide
          mxm Maximilian Michels added a comment -

          Thanks for the update. That's very important information. Have you ran a similar amount of jobs with 1.0.x and experienced this problem?

          In theory, the classloader of old jobs should be discarded when the ExecutionGraph is removed from the archive (after 5 new jobs have bee submitted). Perhaps we hold another reference to the Classloader in the Web UI which prevents it from getting garbage collected.

          Show
          mxm Maximilian Michels added a comment - Thanks for the update. That's very important information. Have you ran a similar amount of jobs with 1.0.x and experienced this problem? In theory, the classloader of old jobs should be discarded when the ExecutionGraph is removed from the archive (after 5 new jobs have bee submitted). Perhaps we hold another reference to the Classloader in the Web UI which prevents it from getting garbage collected.
          Hide
          nielsbasjes Niels Basjes added a comment -

          Today I tried to reproduce this problem with the wordcount example and there this problem does not occur.

          Show
          nielsbasjes Niels Basjes added a comment - Today I tried to reproduce this problem with the wordcount example and there this problem does not occur.
          Hide
          mxm Maximilian Michels added a comment -

          Interesting. But it occurs consistently with your other jobs?

          Show
          mxm Maximilian Michels added a comment - Interesting. But it occurs consistently with your other jobs?
          Hide
          nielsbasjes Niels Basjes added a comment -

          I have tried to create a minimal application that reproduces the problem I see.

          1. Get flink 1.1.1 scala 2.10 binary for Linux.
          2. Manually update yarn-session.sh to the latest in master to fix the HBase classpath issue.
          3. Make sure you have HBase running and configured properly (i.e. HBASE_CONF_DIR and HADOOP_CONF_DIR are setup correctly in your environment).
          4. Create a table called test in HBase with at least 1 row in it.
          5. Start ./flink-1.1.1/bin/yarn-session.sh -n2 -s5 -d
          6. Get this test project and build it: https://github.com/nielsbasjes/Reproduce-FLINK-4485
          7. Then run this jar file with something like ./flink-1.1.1/bin/flink run target/FLINK-4485-1.0-SNAPSHOT.jar several times.
          8. Now when you do on the Hadoop node running the jobmanager lsof | fgrep blob you should see the deleted files as shown before.

          This reproduction path works on my machine ...

          Show
          nielsbasjes Niels Basjes added a comment - I have tried to create a minimal application that reproduces the problem I see. Get flink 1.1.1 scala 2.10 binary for Linux. Manually update yarn-session.sh to the latest in master to fix the HBase classpath issue. Make sure you have HBase running and configured properly (i.e. HBASE_CONF_DIR and HADOOP_CONF_DIR are setup correctly in your environment). Create a table called test in HBase with at least 1 row in it. Start ./flink-1.1.1/bin/yarn-session.sh -n2 -s5 -d Get this test project and build it: https://github.com/nielsbasjes/Reproduce-FLINK-4485 Then run this jar file with something like ./flink-1.1.1/bin/flink run target/ FLINK-4485 -1.0-SNAPSHOT.jar several times. Now when you do on the Hadoop node running the jobmanager lsof | fgrep blob you should see the deleted files as shown before. This reproduction path works on my machine ...
          Hide
          mxm Maximilian Michels added a comment -

          Thanks a lot for the minimal example to reproduce the issue. We will reproduce and fix the issue as soon as possible.

          Show
          mxm Maximilian Michels added a comment - Thanks a lot for the minimal example to reproduce the issue. We will reproduce and fix the issue as soon as possible.
          Hide
          nielsbasjes Niels Basjes added a comment -

          Have you been able to reproduce the problem with the test application I wrote?

          Show
          nielsbasjes Niels Basjes added a comment - Have you been able to reproduce the problem with the test application I wrote?
          Hide
          mxm Maximilian Michels added a comment -

          Yes, I was able to reproduce the problem with your test application yesterday. Checked out the relevant code and I think I'm on to something.

          Show
          mxm Maximilian Michels added a comment - Yes, I was able to reproduce the problem with your test application yesterday. Checked out the relevant code and I think I'm on to something.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user mxm opened a pull request:

          https://github.com/apache/flink/pull/2499

          FLINK-4485 close and remove user class loader after job completion

          Keeping the user class loader around after job completion may lead to
          excessive temp space usage because all user jars are kept until the
          class loader is garbage collected. Tests showed that garbage collection
          can be delayed for a long time after the class loader is not referenced
          anymore. Note that for the class loader to not be referenced anymore,
          its job has to be removed from the archive.

          The fastest way to minimize temp space usage is to close and remove the
          URLClassloader after job completion. This requires us to keep a
          serializable copy of all data which needs the user class loader after
          job completion, e.g. to display data on the web interface.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/mxm/flink FLINK-4485

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2499.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2499


          commit 6ed17b9f5b9c13c80200ccf3db82bbfe727830bb
          Author: Maximilian Michels <mxm@apache.org>
          Date: 2016-09-15T09:00:58Z

          FLINK-4485 close and remove user class loader after job completion

          Keeping the user class loader around after job completion may lead to
          excessive temp space usage because all user jars are kept until the
          class loader is garbage collected. Tests showed that garbage collection
          can be delayed for a long time after the class loader is not referenced
          anymore. Note that for the class loader to not be referenced anymore,
          its job has to be removed from the archive.

          The fastest way to minimize temp space usage is to close and remove the
          URLClassloader after job completion. This requires us to keep a
          serializable copy of all data which needs the user class loader after
          job completion, e.g. to display data on the web interface.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user mxm opened a pull request: https://github.com/apache/flink/pull/2499 FLINK-4485 close and remove user class loader after job completion Keeping the user class loader around after job completion may lead to excessive temp space usage because all user jars are kept until the class loader is garbage collected. Tests showed that garbage collection can be delayed for a long time after the class loader is not referenced anymore. Note that for the class loader to not be referenced anymore, its job has to be removed from the archive. The fastest way to minimize temp space usage is to close and remove the URLClassloader after job completion. This requires us to keep a serializable copy of all data which needs the user class loader after job completion, e.g. to display data on the web interface. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mxm/flink FLINK-4485 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2499.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2499 commit 6ed17b9f5b9c13c80200ccf3db82bbfe727830bb Author: Maximilian Michels <mxm@apache.org> Date: 2016-09-15T09:00:58Z FLINK-4485 close and remove user class loader after job completion Keeping the user class loader around after job completion may lead to excessive temp space usage because all user jars are kept until the class loader is garbage collected. Tests showed that garbage collection can be delayed for a long time after the class loader is not referenced anymore. Note that for the class loader to not be referenced anymore, its job has to be removed from the archive. The fastest way to minimize temp space usage is to close and remove the URLClassloader after job completion. This requires us to keep a serializable copy of all data which needs the user class loader after job completion, e.g. to display data on the web interface.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2499

          Looks food to me.

          +1 to merge

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2499 Looks food to me. +1 to merge
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on the issue:

          https://github.com/apache/flink/pull/2499

          Thanks! Just a few words to @nielsbasjes who reported the issue. I've tested the fix using the test instructions you provided. Even before this fix, I could get rid of the temp files by forcing a manual garbage collection on the JVM, using `jcmd <pid> GC.run`. However, that only worked once the job meta data had been removed from the archive, i.e. it doesn't show up in the web interface anymore. With this fix, the class loader is cleared upon job completion and the files are immediately removed. `lsof | fgrep blob_` didn't show any of these files anymore.

          Note, that we don't perform any cleanup on the TaskManager side. There we also wind up with some left over files but they don't seem to pile up. It must be that the garbage collector can figure out when to clean much earlier. Plus, we don't keep a reference to old Task instances like we do for the web interface on the JobManager side.

          @StephanEwen I'm thinking about adding a similar fix for the TaskManager side. What do you think?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on the issue: https://github.com/apache/flink/pull/2499 Thanks! Just a few words to @nielsbasjes who reported the issue. I've tested the fix using the test instructions you provided. Even before this fix, I could get rid of the temp files by forcing a manual garbage collection on the JVM, using `jcmd <pid> GC.run`. However, that only worked once the job meta data had been removed from the archive, i.e. it doesn't show up in the web interface anymore. With this fix, the class loader is cleared upon job completion and the files are immediately removed. `lsof | fgrep blob_` didn't show any of these files anymore. Note, that we don't perform any cleanup on the TaskManager side. There we also wind up with some left over files but they don't seem to pile up. It must be that the garbage collector can figure out when to clean much earlier. Plus, we don't keep a reference to old Task instances like we do for the web interface on the JobManager side. @StephanEwen I'm thinking about adding a similar fix for the TaskManager side. What do you think?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2499

          What would the fix for the TaskManager look like? Simply explicitly closing the UserCodeClassloader, or does it need more?

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2499 What would the fix for the TaskManager look like? Simply explicitly closing the UserCodeClassloader, or does it need more?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on the issue:

          https://github.com/apache/flink/pull/2499

          @StephanEwen Yes, it is simple. I just pushed a commit. This now releases all temp files after job completion.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on the issue: https://github.com/apache/flink/pull/2499 @StephanEwen Yes, it is simple. I just pushed a commit. This now releases all temp files after job completion.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2499

          Looks good to me.

          +1 to merge when tests pass

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2499 Looks good to me. +1 to merge when tests pass
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on the issue:

          https://github.com/apache/flink/pull/2499

          This needed another fix because in some tests we use the system class loader instead of a class loader instantiated by the BlobLibraryCacheManager. If we close that one, we cause tests to fail. The solution is to close only `FlinkUserCodeClassLoader`s.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on the issue: https://github.com/apache/flink/pull/2499 This needed another fix because in some tests we use the system class loader instead of a class loader instantiated by the BlobLibraryCacheManager. If we close that one, we cause tests to fail. The solution is to close only `FlinkUserCodeClassLoader`s.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on the issue:

          https://github.com/apache/flink/pull/2499

          Merging after tests pass.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on the issue: https://github.com/apache/flink/pull/2499 Merging after tests pass.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/2499

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2499
          Hide
          mxm Maximilian Michels added a comment - - edited

          master: 4a8e94403fb48318561a3cf2da57ba9da280949e
          release-1.1: 62c666f5794fa211bf570874b1b77044fd6840ac

          Show
          mxm Maximilian Michels added a comment - - edited master: 4a8e94403fb48318561a3cf2da57ba9da280949e release-1.1: 62c666f5794fa211bf570874b1b77044fd6840ac
          Hide
          mxm Maximilian Michels added a comment -

          Niels Basjes Let me know if there are any more problems with the temp space.

          Show
          mxm Maximilian Michels added a comment - Niels Basjes Let me know if there are any more problems with the temp space.
          Hide
          nielsbasjes Niels Basjes added a comment -

          I did a run with the latest 'master' and I ran over 80 jobs on a secured cluster without any problems.
          Looks fixed to me.

          Show
          nielsbasjes Niels Basjes added a comment - I did a run with the latest 'master' and I ran over 80 jobs on a secured cluster without any problems. Looks fixed to me.
          Hide
          mxm Maximilian Michels added a comment -

          Great to hear!

          Show
          mxm Maximilian Michels added a comment - Great to hear!

            People

            • Assignee:
              mxm Maximilian Michels
              Reporter:
              nielsbasjes Niels Basjes
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development