Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26907

Does ShuffledRDD Replication Work With External Shuffle Service

    XMLWordPrintableJSON

    Details

    • Type: Question
    • Status: Resolved
    • Priority: Major
    • Resolution: Invalid
    • Affects Version/s: 2.3.2
    • Fix Version/s: None
    • Component/s: Block Manager, Spark Core, YARN
    • Labels:
      None

      Description

      I am interested in working with high replication environments for extreme fault tolerance (e.g. 10x replication), but have noticed that when using groupBy or groupWith followed by persist (with 10x replication), even if one node fails, the entire stage can fail with FetchFailedException.

       

      Is this because the External Shuffle Service writes and services intermediate shuffle data only to/from the local disk attached to the executor that generated it, causing spark to ignore possible replicated shuffle data (from the persist) that may be serviced elsewhere? If so, is there any way to increase the replication factor of the External Shuffle Service to make it fault tolerant?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              altaeth Han Altae-Tran
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: