Description
It seems that there is some sort of race condition when evaluator is completed....The evaluator says that it is done and exits while driver gets the same message as if RM has taken the container away.
To be more concrete, end of evaluator log and relevant driver log are shown below. One can clearly see that shut down appeared to be clean but driver got failed evaluator message.The driver log shows that stream on which it was reading message from evaluator was closed probably because evaluator closed it. A simple fix would be that if driver expects this evaluator to shutdown and it gets this message then it assumes that shutdown is complete.
End of Evaluator Log
INFO: Handle Evaluator control message
Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
INFO: Send task control message to ContextManager
Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextManager Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
INFO: RemoveContext with id DataLoading-container_e12_1458013582010_0048_01_000750
Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextRuntime Warning: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
WARNING: Shutting down an task because the underlying context is being closed.
Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
INFO: Trying to close Task IMRUMap-Id297-Version0
Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Warning: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
WARNING: Trying to close an task that is in Done state. Ignored.
Org.Apache.REEF.Common.Context.Defaults.DefaultContextStopHandler Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
INFO: DefaultContextStopHandler received for context: DataLoading-container_e12_1458013582010_0048_01_000750
Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
INFO: Context stack is empty, done
Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
INFO: Evaluator state : DONE
Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[], evaluator_status=[DONE], context_status=[], timestamp=[1459201619420], recoveryFlag =[False].
Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0001
INFO: Runtime stop
Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4364238-07:00 0001
INFO: EvaluatorRuntime shutdown complete
Org.Apache.REEF.Evaluator.Evaluator Information: 0 : 2016-03-28T14:46:59.4364238-07:00 0001
INFO: Evaluator is returned from Run()
End of LogType:evaluator.stdout
Relevant driver log
INFO: Received a JobStatus message that can't be sent:
identifier: "BroadcastReduce"
state: RUNNING
message: "Evaluator container_e12_1458013582010_0048_01_000750 failed with message: Evaluator [container_e12_1458013582010_0048_01_000750] is assumed to be in state [RUNNING]. But the resource manager reports it to be in state [DONE]. This means that the Evaluator failed but wasn\'t able to send an error message back to the driver."
Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught
INFO: Unexpected exception from downstream. channel: [id: 0x51bdfbfb, /10.200.144.38:64266 => /10.200.144.168:8959] local: /10.200.144.168:8959 remote: /10.200.144.38:64266
Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener exceptionCaught
WARNING: ExceptionEvent: local: /10.200.144.168:8959 remote: /10.200.144.247:53406 :: java.io.IOException: An existing connection was forcibly closed by the remote host
Mar 28, 2016 2:47:00 PM org.apache.reef.javabridge.generic.JobDriver$CompletedEvaluatorHandler onNext
INFO: Handling the event of completed evaluator in CLR bridge.
Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught
INFO: Unexpected exception from downstream. channel: [id: 0xdc25e349, /10.200.145.206:64394 => /10.200.144.168:8959] local: /10.200.144.168:8959 remote: /10.200.145.206:64394
Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught
WARNING: Unexpected exception from downstream.
java.io.IOException: An existing connection was forcibly closed by the remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:446)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:879)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:114)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
Attachments
Issue Links
- Dependent
-
ZOOKEEPER-704 GSoC 2010: Read-Only Mode
- Open
- is related to
-
REEF-1310 The Java Driver should ACK the Java Evaluator's DONE heartbeat
- Resolved
- is superceded by
-
REEF-976 Fix broken C# Tests caused by race condition of local RM
- Resolved