Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28239

Allow TCP connections created by shuffle service auto close on YARN NodeManagers

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.1.0
    • None
    • Shuffle, Spark Core, YARN
    • None
    • Hadoop2.6.0-CDH5.8.3(netty3)
      Spark2.4.0(netty4)

      Configs:
      spark.shuffle.service.enabled=true

    Description

      When executing shuffle tasks, TCP connections(on port 7337 by default) will be established by shuffle service.
      It will like:

      However, some of the TCP connections are still busy when the task is actually finished. These connections won't close automatically until we restart the NodeManager process.

      Connections pile up and NodeManagers are getting slower and slower.

      These unclosed TCP connections stay busy and it seem doesn't take effect when I set ChannelOption.SO_KEEPALIVE to true according to SPARK-23182.

      So the solution is setting ChannelOption.AUTO_CLOSE to true, and after which our cluster(running 10000+ jobs / day) is processing normally.

      Attachments

        1. screenshot-2.png
          46 kB
          Deegue
        2. screenshot-1.png
          175 kB
          Deegue

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Deegue Deegue
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: