Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-29618

YARNSessionFIFOSecuredITCase.testDetachedMode timed out in Azure CI

    XMLWordPrintableJSON

Details

    Description

      We experienced a build failure that was caused (exclusively) by YARNSessionFIFOSecuredITCase.testDetachedMode running into a timeout.

      The test specific logs which were extracted from the build's are attached to this Jira issue.

      JUnit tries to stop the thread running the test but fails to due so because it's interrupting a sleep. The InterruptedException is not properly handled in YarnTestBase:744 (it doesn't forward the exception). Therefore, we only see the warning being logged after 60s:

      11:33:51,124 [ForkJoinPool-1-worker-25] WARN  org.apache.flink.yarn.YarnTestBase                           [] - Interruped
      java.lang.InterruptedException: sleep interrupted
              at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_292]
              at org.apache.flink.yarn.YarnTestBase.sleep(YarnTestBase.java:716) ~[test-classes/:?]
              at org.apache.flink.yarn.YarnTestBase.startWithArgs(YarnTestBase.java:906) ~[test-classes/:?]
              at org.apache.flink.yarn.YARNSessionFIFOITCase.runDetachedModeTest(YARNSessionFIFOITCase.java:141) ~[test-classes/:?]
              at org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.lambda$testDetachedMode$2(YARNSessionFIFOSecuredITCase.java:173) ~[test-classes/:?]
              at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:288) ~[test-classes/:?]
              at org.apache.flink.yarn.YARNSessionFIFOSecuredITCase.testDetachedMode(YARNSessionFIFOSecuredITCase.java:160) ~[test-classes/:?]
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_292]
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_292]
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_292]
              at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292]
      [...]
      

      The test code itself eventually continues and succeeds (despite the interruption). The job submission takes suspiciously long, though.

      Removing the timeout from the test (as this is the desired approach for tests in general now) should solve this test instability.

      Attachments

        Issue Links

          Activity

            People

              Wencong Liu Wencong Liu
              mapohl Matthias Pohl
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: