Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
1.6.1
-
None
Description
I have the following network configuration:
+--------------------+ | | | spark-shell | | | +- ip: 10.110.101.2 -+ | | +- ip: 10.110.101.1 -+ | | NAT + routing | spark-master | configured | | +- ip: 10.110.100.1 -+ | +------------------------+ | | +- ip: 10.110.101.2 -+ +- ip: 10.110.101.3 -+ | | | | | spark-worker 1 | | spark-worker 2 | | | | | +--------------------+ +--------------------+
I have NAT, DNS and routing correctly configure such as each machine can communicate with each other.
Launch spark-shell against the cluster works well. Simple map operations work too:
scala> sc.makeRDD(1 to 5).map(_ * 5).collect res0: Array[Int] = Array(5, 10, 15, 20, 25)
But operations requiring shuffling fail:
scala> sc.makeRDD(1 to 5).map(i => (i,1)).reduceByKey(_ + _).collect 16/04/22 15:33:17 WARN TaskSetManager: Lost task 4.0 in stage 2.0 (TID 19, 10.110.101.1): FetchFailed(BlockManagerId(0, 10.110.101.1, 42842), shuffleId=0, mapId=6, reduceId=4, message= org.apache.spark.shuffle.FetchFailedException: Failed to connect to /10.110.101.1:42842 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) [ ... ] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to connect to /10.110.101.1:42842 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) [ ... ] at org.apache.spark.network.shuffle.RetryingBlockFetcher.access [ ... ]
It makes sense that a connection to 10.110.101.1:42842 would fail, no part of the system should have a direct knowledge of the IP address 10.110.101.1.
So a part of the system is wrongly discovering this IP address.
Attachments
Issue Links
- is related to
-
SPARK-14437 Spark using Netty RPC gets wrong address in some setups
- Resolved
- links to