Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20640

Make rpc timeout and retry for shuffle registration configurable

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.2
    • Fix Version/s: 2.3.0
    • Component/s: Shuffle
    • Labels:
      None

      Description

      Currently the shuffle service registration timeout and retry has been hardcoded (see https://github.com/sitalkedia/spark/blob/master/network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleClient.java#L144 and https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L197). This works well for small workloads but under heavy workload when the shuffle service is busy transferring large amount of data we see significant delay in responding to the registration request, as a result we often see the executors fail to register with the shuffle service, eventually failing the job. We need to make these two parameters configurable.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                lyc Li Yichao
                Reporter:
                sitalkedia@gmail.com Sital Kedia
              • Votes:
                1 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: