Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-8834

Job fails to restart due to some tasks stuck in cancelling state

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4.0
    • 1.5.0
    • None
    • None
    • AWS EMR 5.12

      Flink 1.4.0

      Beam 2.3.0

    Description

      Our job threw an exception overnight, causing the job to commence attempting a restart.

      However it never managed to restart because 2 tasks on one of the Task Managers are stuck in "Cancelling" state, with the following exception

      2018-03-02 02:29:31,604 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'PTransformTranslation.UnknownRawPTransform -> ParDoTranslation.RawParDo -> ParDoTranslation.RawParDo -> uk.co.bbc.sawmill.streaming.pipeline.output.io.file.WriteWindowToFile-RDotRecord/TextIO.Write/WriteFiles/GatherTempFileResults/Reshuffle/Window.Into()/Window.Assign.out -> ParDoTranslation.RawParDo -> ToKeyedWorkItem (24/32)' did not react to cancelling signal, but is stuck in method:
       java.lang.Thread.blockedOn(Thread.java:239)
      java.lang.System$2.blockedOn(System.java:1252)
      java.nio.channels.spi.AbstractInterruptibleChannel.blockedOn(AbstractInterruptibleChannel.java:211)
      java.nio.channels.spi.AbstractInterruptibleChannel.begin(AbstractInterruptibleChannel.java:170)
      java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:457)
      java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
      java.nio.channels.Channels.writeFully(Channels.java:101)
      java.nio.channels.Channels.access$000(Channels.java:61)
      java.nio.channels.Channels$1.write(Channels.java:174)
      java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
      java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
      java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
      java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:458)
      java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
      java.nio.channels.Channels.writeFully(Channels.java:101)
      java.nio.channels.Channels.access$000(Channels.java:61)
      java.nio.channels.Channels$1.write(Channels.java:174)
      sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
      sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
      sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
      sun.nio.cs.StreamEncoder.write(StreamEncoder.java:135)
      java.io.OutputStreamWriter.write(OutputStreamWriter.java:220)
      java.io.Writer.write(Writer.java:157)
      org.apache.beam.sdk.io.TextSink$TextWriter.writeLine(TextSink.java:102)
      org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:118)
      org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:76)
      org.apache.beam.sdk.io.WriteFiles.writeOrClose(WriteFiles.java:550)
      org.apache.beam.sdk.io.WriteFiles.access$1000(WriteFiles.java:112)
      org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:718)
      org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
      org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
      org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:138)
      org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65)
      org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:425)
      org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
      org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
      org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
      org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
      org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
      org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:888)
      org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:865)
      org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:94)
      org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:87)
      org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1040)
      org.apache.beam.runners.core.ReduceFnRunner$$Lambda$163/1408647946.output(Unknown Source)
      org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:433)
      org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:127)
      org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1043)
      org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:911)
      org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:776)
      org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn.processElement(GroupAlsoByWindowViaWindowSetNewDoFn.java:134)
      org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)
      org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
      org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141)
      org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:74)
      org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65)
      org.apache.beam.runners.flink.translation.wrappers.streaming.WindowDoFnOperator.fireTimer(WindowDoFnOperator.java:108)
      org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.onEventTime(DoFnOperator.java:767)
      org.apache.flink.streaming.api.operators.HeapInternalTimerService.advanceWatermark(HeapInternalTimerService.java:277)
      org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark1(DoFnOperator.java:532)
      org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark(DoFnOperator.java:501)
      org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:288)
      org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
      org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
      org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189)
      org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
      org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
      org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
      java.lang.Thread.run(Thread.java:748)
      
      2018-03-02 02:29:32,332 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'PTransformTranslation.UnknownRawPTransform -> ParDoTranslation.RawParDo -> ParDoTranslation.RawParDo -> uk.co.bbc.sawmill.streaming.pipeline.output.io.file.WriteWindowToFile-RDotRecord2/TextIO.Write/WriteFiles/GatherTempFileResults/Reshuffle/Window.Into()/Window.Assign.out -> ParDoTranslation.RawParDo -> ToKeyedWorkItem (22/32)' did not react to cancelling signal, but is stuck in method:
       java.lang.Thread.blockedOn(Thread.java:239)
      java.lang.System$2.blockedOn(System.java:1252)
      java.nio.channels.spi.AbstractInterruptibleChannel.blockedOn(AbstractInterruptibleChannel.java:211)
      java.nio.channels.spi.AbstractInterruptibleChannel.begin(AbstractInterruptibleChannel.java:170)
      java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:457)
      java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
      java.nio.channels.Channels.writeFully(Channels.java:101)
      java.nio.channels.Channels.access$000(Channels.java:61)
      java.nio.channels.Channels$1.write(Channels.java:174)
      java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
      java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
      java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
      java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:458)
      java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
      java.nio.channels.Channels.writeFully(Channels.java:101)
      java.nio.channels.Channels.access$000(Channels.java:61)
      java.nio.channels.Channels$1.write(Channels.java:174)
      sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
      sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
      sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
      sun.nio.cs.StreamEncoder.write(StreamEncoder.java:135)
      java.io.OutputStreamWriter.write(OutputStreamWriter.java:220)
      java.io.Writer.write(Writer.java:157)
      org.apache.beam.sdk.io.TextSink$TextWriter.writeLine(TextSink.java:102)
      org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:118)
      org.apache.beam.sdk.io.TextSink$TextWriter.write(TextSink.java:76)
      org.apache.beam.sdk.io.WriteFiles.writeOrClose(WriteFiles.java:550)
      org.apache.beam.sdk.io.WriteFiles.access$1000(WriteFiles.java:112)
      org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:718)
      org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
      org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
      org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:138)
      org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65)
      org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:425)
      org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
      org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
      org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
      org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
      org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
      org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:888)
      org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:865)
      org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:94)
      org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$1.outputWindowedValue(GroupAlsoByWindowViaWindowSetNewDoFn.java:87)
      org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1040)
      org.apache.beam.runners.core.ReduceFnRunner$$Lambda$163/1408647946.output(Unknown Source)
      org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:433)
      org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:127)
      org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1043)
      org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:911)
      org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:776)
      org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn.processElement(GroupAlsoByWindowViaWindowSetNewDoFn.java:134)
      org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)
      org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
      org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141)
      org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:74)
      org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:65)
      org.apache.beam.runners.flink.translation.wrappers.streaming.WindowDoFnOperator.fireTimer(WindowDoFnOperator.java:108)
      org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.onEventTime(DoFnOperator.java:767)
      org.apache.flink.streaming.api.operators.HeapInternalTimerService.advanceWatermark(HeapInternalTimerService.java:277)
      org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark1(DoFnOperator.java:532)
      org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processWatermark(DoFnOperator.java:501)
      org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:288)
      org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189)
      org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111)
      org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189)
      org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
      org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
      org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
      java.lang.Thread.run(Thread.java:748)
      

      I can see a bit further up in the logs the following exceptions too (although not sure if they are related) - this exception looks similar to FLINK-8751

      2018-03-02 02:29:07,094 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask           - Could not shut down timer service
      java.lang.InterruptedException
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2067)
              at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
              at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService.shutdownAndAwaitPending(SystemProcessingTimeService.java:197)
              at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:317)
              at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)

       

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              djharper Daniel Harper
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: