Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14849

shuffle broken when accessing standalone cluster through NAT

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 1.6.1
    • Fix Version/s: None
    • Component/s: Spark Core

      Description

      I have the following network configuration:

                   +--------------------+
                   |                    |
                   |  spark-shell       |
                   |                    |
                   +- ip: 10.110.101.2 -+
                             |
                             |
                   +- ip: 10.110.101.1 -+
                   |                    | NAT + routing
                   |  spark-master      | configured
                   |                    |
                   +- ip: 10.110.100.1 -+
                             |
                +------------------------+
                |                        |
      +- ip: 10.110.101.2 -+    +- ip: 10.110.101.3 -+
      |                    |    |                    |
      |  spark-worker 1    |    |  spark-worker 2    |
      |                    |    |                    |
      +--------------------+    +--------------------+
      

      I have NAT, DNS and routing correctly configure such as each machine can communicate with each other.

      Launch spark-shell against the cluster works well. Simple map operations work too:

      scala> sc.makeRDD(1 to 5).map(_ * 5).collect
      res0: Array[Int] = Array(5, 10, 15, 20, 25)
      

      But operations requiring shuffling fail:

      scala> sc.makeRDD(1 to 5).map(i => (i,1)).reduceByKey(_ + _).collect
      
      16/04/22 15:33:17 WARN TaskSetManager: Lost task 4.0 in stage 2.0 (TID 19, 10.110.101.1): FetchFailed(BlockManagerId(0, 10.110.101.1, 42842), shuffleId=0, mapId=6, reduceId=4, message=
      org.apache.spark.shuffle.FetchFailedException: Failed to connect to /10.110.101.1:42842
      	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
      [ ... ]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: Failed to connect to /10.110.101.1:42842
      	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
      [ ... ]
      	at org.apache.spark.network.shuffle.RetryingBlockFetcher.access
      
      [ ... ]
      

      It makes sense that a connection to 10.110.101.1:42842 would fail, no part of the system should have a direct knowledge of the IP address 10.110.101.1.
      So a part of the system is wrongly discovering this IP address.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                skyluc Luc Bourlier
              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: