Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28239

Allow TCP connections created by shuffle service auto close on YARN NodeManagers

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: In Progress
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: Shuffle, YARN
    • Labels:
      None
    • Environment:

      Hadoop2.6.0-CDH5.8.3(netty3)
      Spark2.4.0(netty4)

      Configs:
      spark.shuffle.service.enabled=true

      Description

      When executing shuffle tasks, TCP connections(on port 7337 by default) will be established by shuffle service.
      It will like:

      However, some of the TCP connections are still busy when the task is actually finished. These connections won't close automatically until we restart the NodeManager process.

      Connections pile up and NodeManagers are getting slower and slower.

      These unclosed TCP connections stay busy and it seem doesn't take effect when I set ChannelOption.SO_KEEPALIVE to true according to SPARK-23182.

      So the solution is setting ChannelOption.AUTO_CLOSE to true, and after which our cluster(running 10000+ jobs / day) is processing normally.

        Attachments

        1. screenshot-2.png
          46 kB
          Deegue
        2. screenshot-1.png
          175 kB
          Deegue

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Deegue Deegue
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: