Details

    Description

      We're experiencing a fatal crash in TaskTest:
      https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=45440&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=8334

      [...]
      Jan 31 01:03:12 [ERROR] Process Exit Code: 239
      Jan 31 01:03:12 [ERROR] Crashed tests:
      Jan 31 01:03:12 [ERROR] org.apache.flink.runtime.taskmanager.TaskTest
      Jan 31 01:03:12 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:748)
      Jan 31 01:03:12 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.access$700(ForkStarter.java:121)
      Jan 31 01:03:12 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:393)
      Jan 31 01:03:12 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter$1.call(ForkStarter.java:370)
      Jan 31 01:03:12 [ERROR] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      Jan 31 01:03:12 [ERROR] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      Jan 31 01:03:12 [ERROR] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      Jan 31 01:03:12 [ERROR] at java.lang.Thread.run(Thread.java:748)
      Jan 31 01:03:12 [ERROR] -> [Help 1]
      Jan 31 01:03:12 [ERROR] 
      Jan 31 01:03:12 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
      Jan 31 01:03:12 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
      Jan 31 01:03:12 [ERROR] 
      Jan 31 01:03:12 [ERROR] For more information about the errors and possible solutions, please read the following articles:
      Jan 31 01:03:12 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
      Jan 31 01:03:12 [ERROR] 
      Jan 31 01:03:12 [ERROR] After correcting the problems, you can resume the build with the command
      Jan 31 01:03:12 [ERROR]   mvn <goals> -rf :flink-runtime
      

      Attachments

        Issue Links

        Activity

          mapohl Matthias Pohl added a comment -

          An AssertionError is reported that should be independent from this build failure. I created FLINK-30852 to cover the issue.

          mapohl Matthias Pohl added a comment - An AssertionError is reported that should be independent from this build failure. I created FLINK-30852 to cover the issue.
          mapohl Matthias Pohl added a comment -

          TaskTest.testInterruptibleSharedLockInInvokeAndCancel caused the failure

          00:59:02,291 [Cancellation Watchdog for Test Task (1/1)#0 (003bbd51a0b61b0ff2925c31e749f53e_00000000000000000000000000000000_0_0).] ERROR org.apache.flink.util.FatalExitExceptionHandler              [] - FATAL: Thread 'Cancellation Watchdog for Test Task (1/1)#0 (003bbd51a0b61b0ff2925c31e749f53e_00000000000000000000000000000000_0_0).' produced an uncaught exception. Stopping the process...
          org.apache.flink.util.FlinkRuntimeException: Error in Task Cancellation Watch Dog
                  at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1801) ~[classes/:?]
                  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
          Caused by: java.lang.RuntimeException: Unexpected FatalError notification
                  at org.apache.flink.runtime.taskmanager.TaskTest$ProhibitFatalErrorTaskManagerActions.notifyFatalError(TaskTest.java:1278) ~[test-classes/:?]
                  at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1798) ~[classes/:?]
                  ... 1 more
          

          The TaskCancelerWatchDog causes the System.exit when the executor thread is still alive (see Task:1781).

          Piotr Nowicki Anton Kalashnikov May one of you have a look at this?

          mapohl Matthias Pohl added a comment - TaskTest.testInterruptibleSharedLockInInvokeAndCancel caused the failure 00:59:02,291 [Cancellation Watchdog for Test Task (1/1)#0 (003bbd51a0b61b0ff2925c31e749f53e_00000000000000000000000000000000_0_0).] ERROR org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: Thread 'Cancellation Watchdog for Test Task (1/1)#0 (003bbd51a0b61b0ff2925c31e749f53e_00000000000000000000000000000000_0_0).' produced an uncaught exception. Stopping the process... org.apache.flink.util.FlinkRuntimeException: Error in Task Cancellation Watch Dog at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1801) ~[classes/:?] at java.lang. Thread .run( Thread .java:748) [?:1.8.0_292] Caused by: java.lang.RuntimeException: Unexpected FatalError notification at org.apache.flink.runtime.taskmanager.TaskTest$ProhibitFatalErrorTaskManagerActions.notifyFatalError(TaskTest.java:1278) ~[test-classes/:?] at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1798) ~[classes/:?] ... 1 more The TaskCancelerWatchDog causes the System.exit when the executor thread is still alive (see Task:1781 ). Piotr Nowicki Anton Kalashnikov May one of you have a look at this?
          piotr.nowicki Piotr Nowicki added a comment -

          Matthias Pohl I believe I'm not the person you wanted to notify here

          piotr.nowicki Piotr Nowicki added a comment - Matthias Pohl I believe I'm not the person you wanted to notify here
          mapohl Matthias Pohl added a comment - - edited

          ah, typo! 🤦‍♂️ sorry for the spam, Piotr Nowicki. I meant Piotr Nowojski

          mapohl Matthias Pohl added a comment - - edited ah, typo! 🤦‍♂️ sorry for the spam, Piotr Nowicki. I meant Piotr Nowojski

          I don't see a problem here. I just see that the thread is finishing longer than 50ms but I see it isn't stuck and make progress which is good. Normally, it should take about 1ms but 50ms is also not something extraordinal on the overloaded machine. The only thing is a synchronization inside the finishing loop which can delay the actual finish but I have no evidence that it is a reason.
          I will try to check more ideas but if nothing works I will just increase the waiting interval for this test.(I actually think that 50ms is low anyway)

          akalashnikov Anton Kalashnikov added a comment - I don't see a problem here. I just see that the thread is finishing longer than 50ms but I see it isn't stuck and make progress which is good. Normally, it should take about 1ms but 50ms is also not something extraordinal on the overloaded machine. The only thing is a synchronization inside the finishing loop which can delay the actual finish but I have no evidence that it is a reason. I will try to check more ideas but if nothing works I will just increase the waiting interval for this test.(I actually think that 50ms is low anyway)
          mapohl Matthias Pohl added a comment -

          thanks for sharing your thoughts. It sounds reasonable. I'm gonna lower the priority for that one to Major.

          mapohl Matthias Pohl added a comment - thanks for sharing your thoughts. It sounds reasonable. I'm gonna lower the priority for that one to Major.
          mapohl Matthias Pohl added a comment - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=47318&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702&l=8351
          mapohl Matthias Pohl added a comment - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=47748&view=logs&j=a549b384-c55a-52c0-c451-00e0477ab6db&t=eef5922c-08d9-5ba3-7299-8393476594e7&l=8807
          mapohl Matthias Pohl added a comment - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=47750&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702&l=8385

          In conclusion, I haven't found any deadlocks or other suspicious things. I was able to reproduce it locally it seems it indeed just works pretty slowly on an overloaded machine. So I just increased the timeout. We will see how it will be.

          merged to master: 6e95bfaf

          akalash Anton Kalashnikov added a comment - In conclusion, I haven't found any deadlocks or other suspicious things. I was able to reproduce it locally it seems it indeed just works pretty slowly on an overloaded machine. So I just increased the timeout. We will see how it will be. merged to master: 6e95bfaf
          Sergey Nuyanzin Sergey Nuyanzin added a comment - Merged to 1.16 as 141b47a80092134f95dfdf6b3d0e7051d4fec6bb

          People

            akalashnikov Anton Kalashnikov
            mapohl Matthias Pohl
            Votes:
            0 Vote for this issue
            Watchers:
            Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Error getting Slack data from Jira

                Issue deployment

                  Client must be authenticated to access this resource.