Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9328

Netty IO layer should implement read timeouts

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.2.1, 1.3.1
    • Fix Version/s: 1.4.0
    • Component/s: Shuffle, Spark Core
    • Labels:
      None

      Description

      Spark's network layer does not implement read timeouts which may lead to stalls during shuffle: if a remote shuffle server stalls while responding to a shuffle block fetch request but does not close the socket then the job may block until an OS-level socket timeout occurs.

      I think that we can fix this using Netty's ReadTimeoutHandler (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler). The tricky part of working on this will be figuring out the right place to add the handler and ensuring that we don't introduce performance issues by not re-using sockets.

      Quoting from that linked StackOverflow question:

      Note that the ReadTimeoutHandler is also unaware of whether you have sent a request - it only cares whether data has been read from the socket. If your connection is persistent, and you only want read timeouts to fire when a request has been sent, you'll need to build a request / response aware timeout handler.

      If we want to avoid tearing down connections between shuffles then we may have to do something like this.

        Attachments

          Activity

            People

            • Assignee:
              joshrosen Josh Rosen
              Reporter:
              joshrosen Josh Rosen
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: