Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-8707

Excessive amount of files opened by flink task manager

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Not A Problem
    • 1.3.2
    • None
    • Runtime / Coordination
    • None

    Description

      The job manager has less FDs than the task manager.

       

      Hi

      A support alert indicated that there were a lot of open files for the boxes running Flink.

      There were 4 flink jobs that were dormant but had consumed a number of msgs from Kafka using the FlinkKafkaConsumer010.

      A simple general lsof:

      $ lsof | wc -l       ->  returned 153114 open file descriptors.

      Focusing on the TaskManager process (process ID = 12154):

      $ lsof | grep 12154 | wc l    > returned 129322 open FDs

      $ lsof -p 12154 | wc -l   -> returned 531 FDs

      There were 228 threads running for the task manager.

       

      Drilling down a bit further, looking at a_inode and FIFO entries: 

      $ lsof -p 12154 | grep a_inode | wc -l = 100 FDs

      $ lsof -p 12154 | grep FIFO | wc -l  = 200 FDs

      $ /proc/12154/maps = 920 entries.

      Apart from lsof identifying lots of JARs and SOs being referenced there were also 244 child processes for the task manager process.

      Noticed that in each environment, a creep of file descriptors...are the above figures deemed excessive for the no of FDs in use? I know Flink uses Netty - is it using a separate Selector for reads & writes? 

      Additionally Flink uses memory mapped files? or direct bytebuffers are these skewing the numbers of FDs shown?

      Example of one child process ID 6633:

      java 12154 6633 dfdev 387u a_inode 0,9 0 5869 [eventpoll]
      java 12154 6633 dfdev 388r FIFO 0,8 0t0 459758080 pipe
      java 12154 6633 dfdev 389w FIFO 0,8 0t0 459758080 pipe

      Lasty, cannot identify yet the reason for the creep in FDs even if Flink is pretty dormant or has dormant jobs. Production nodes are not experiencing excessive amounts of throughput yet either.

       

       

       

      Attachments

        1. lsofp.txt
          40 kB
          Alon Galant
        2. lsofp.txt
          40 kB
          Alon Galant
        3. lsof.txt
          5.09 MB
          Alon Galant
        4. lsof.txt
          5.09 MB
          Alon Galant
        5. ll.txt
          5 kB
          Alon Galant
        6. ll.txt
          5 kB
          Alon Galant
        7. box2-taskmgr-lsof
          1.72 MB
          Alexander Gardner
        8. box2-jobmgr-lsof
          2.54 MB
          Alexander Gardner
        9. box1-taskmgr-lsof
          1.59 MB
          Alexander Gardner
        10. box1-jobmgr-lsof
          1.71 MB
          Alexander Gardner
        11. AterRunning-3-jobs-Box1-TM-JCONSOLE.png
          122 kB
          Alexander Gardner
        12. AfterRunning-3-jobs-TM-FDs-BOX2.jpg
          449 kB
          Alexander Gardner
        13. AfterRunning-3-jobs-lsof-p.box2-TM
          124 kB
          Alexander Gardner
        14. AfterRunning-3-jobs-lsof.box2-TM
          28.66 MB
          Alexander Gardner
        15. AfterRunning-3-jobs-Box2-TM-JCONSOLE.png
          103 kB
          Alexander Gardner

        Activity

          People

            Unassigned Unassigned
            imogard Alexander Gardner
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: