Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17933

Shuffle fails when driver is on one of the same machines as executor

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.6.2
    • None
    • Shuffle, Spark Core

    Description

      Problem

      When I run a job that requires some shuffle, some tasks fail because the executor cannot fetch the shuffle blocks from another executor.

      org.apache.spark.shuffle.FetchFailedException: Failed to connect to 10-250-20-140:44042
      	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
      	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
      	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
      	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
      	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
      	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
      	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
      	at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
      	at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
      	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
      	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
      	at org.apache.spark.scheduler.Task.run(Task.scala:89)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042
      	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
      	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
      	at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
      	at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
      	at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
      	at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	... 3 more
      Caused by: java.nio.channels.UnresolvedAddressException
      	at sun.nio.ch.Net.checkAddress(Net.java:101)
      	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
      	at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
      	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
      	at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097)
      	at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
      	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
      	at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
      	at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
      	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
      	at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
      	at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
      	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
      	at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:438)
      	at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:908)
      	at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:203)
      	at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
      	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
      	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
      	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
      	... 1 more
      

      When you look closely you notice that it is trying to connect not through a correct IP address but the hostname (which is the IP address but with dashes). Unfortunately the hostname is not valid from another host other than the one so the other executors cannot talk to this one.

      On the executor page (two screenshots) you can see what is happening (different IP addresses but same behaviour).

      Why is the executor advertised using the hostname in this particular case? Is it a bug or expected behaviour? This only happens when the executor is on the same host as the driver.

      Attachments

        1. screenshot-2.png
          47 kB
          Frank Rosner
        2. screenshot-1.png
          48 kB
          Frank Rosner

        Activity

          People

            Unassigned Unassigned
            frosner Frank Rosner
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: