Details

      Description

      Quite seldomly, JobManagerITCase seems to hang, e.g. see https://api.travis-ci.org/jobs/220888193/log.txt?deansi=true

      The maven watchdog kills the build due to not output being produced within 300s and JobManagerITCase seems to hang in line 772, i.e.

      JobManagerITCase lines 770-772
      // Trigger savepoint for non-existing job
      jobManager.tell(TriggerSavepoint(jobId, Option.apply("any")), testActor)
      val response = expectMsgType[TriggerSavepointFailure](deadline.timeLeft)
      

      Although the (downloaded) logs do not quite allow a precise mapping to this test case, it looks as if the following block may be related:

      09:34:47,684 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster         - Akka ask timeout set to 100s
      09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster         - Disabled queryable state server
      09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster         - Starting FlinkMiniCluster.
      09:34:47,809 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
      09:34:47,837 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /tmp/blobStore-eab23d04-ea18-4dc5-b1df-fcf9fc295062
      09:34:47,838 WARN  org.apache.flink.runtime.net.SSLUtils                         - Not a SSL socket, will skip setting tls version and cipher suites.
      09:34:47,839 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:36745 - max concurrent requests: 50 - max backlog: 1000
      09:34:47,840 INFO  org.apache.flink.runtime.metrics.MetricRegistry               - No metrics reporter configured, no metrics will be exposed/reported.
      09:34:47,850 INFO  org.apache.flink.runtime.testingUtils.TestingMemoryArchivist  - Started memory archivist akka://flink/user/archive_1
      09:34:47,860 INFO  org.apache.flink.runtime.testutils.TestingResourceManager     - Trying to associate with JobManager leader akka://flink/user/jobmanager_1
      09:34:47,861 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager       - Starting JobManager at akka://flink/user/jobmanager_1.
      09:34:47,862 WARN  org.apache.flink.runtime.testingUtils.TestingJobManager       - Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,TriggerSavepoint(6e813070338a23b0ff571646bca56521,Some(any))) because there is currently no valid leader id known.
      09:34:47,862 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager       - JobManager akka://flink/user/jobmanager_1 was granted leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
      09:34:47,867 INFO  org.apache.flink.runtime.testutils.TestingResourceManager     - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager_1#-652927556] - leader session 00000000-0000-0000-0000-000000000000
      

      If so, then this may be related to FLINK-6287 and may possibly even be a duplicate.

      What is strange though is that the timeout for the expected message to arrive is no more than 2m and thus the test should properly fail within 300s.

        Issue Links

          Activity

          Hide
          NicoK Nico Kruber added a comment -

          Same here (with only the transfer.sh upload changed compared to master)
          https://s3.amazonaws.com/archive.travis-ci.org/jobs/220888197/log.txt

          Show
          NicoK Nico Kruber added a comment - Same here (with only the transfer.sh upload changed compared to master) https://s3.amazonaws.com/archive.travis-ci.org/jobs/220888197/log.txt
          Show
          StephanEwen Stephan Ewen added a comment - Another failed instance: https://s3.amazonaws.com/archive.travis-ci.org/jobs/223677152/log.txt
          Hide
          StephanEwen Stephan Ewen added a comment -

          Hitting this frequently on local builds as well:

          Tests run: 21, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1,220.166 sec <<< FAILURE! - in org.apache.flink.runtime.jobmanager.JobManagerITCase
          The JobManager actor must handle trigger savepoint response for non-existing job(org.apache.flink.runtime.jobmanager.JobManagerITCase)  Time elapsed: 1,199.316 sec  <<< FAILURE!
          java.lang.AssertionError: assertion failed: timeout (1199213200030 nanoseconds) during expectMsgClass waiting for class org.apache.flink.runtime.messages.JobManagerMessages$TriggerSavepointFailure
          	at scala.Predef$.assert(Predef.scala:179)
          	at akka.testkit.TestKitBase$class.expectMsgClass_internal(TestKit.scala:423)
          	at akka.testkit.TestKitBase$class.expectMsgType(TestKit.scala:405)
          	at akka.testkit.TestKit.expectMsgType(TestKit.scala:718)
          	at org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34$$anonfun$apply$mcV$sp$35.apply$mcV$sp(JobManagerITCase.scala:772)
          	at org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34$$anonfun$apply$mcV$sp$35.apply(JobManagerITCase.scala:764)
          	at org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34$$anonfun$apply$mcV$sp$35.apply(JobManagerITCase.scala:764)
          	at akka.testkit.TestKitBase$class.within(TestKit.scala:296)
          	at akka.testkit.TestKit.within(TestKit.scala:718)
          	at akka.testkit.TestKitBase$class.within(TestKit.scala:310)
          	at akka.testkit.TestKit.within(TestKit.scala:718)
          	at org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34.apply$mcV$sp(JobManagerITCase.scala:764)
          	at org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34.apply(JobManagerITCase.scala:758)
          	at org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34.apply(JobManagerITCase.scala:758)
          	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
          	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
          	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
          	at org.scalatest.Transformer.apply(Transformer.scala:22)
          	at org.scalatest.Transformer.apply(Transformer.scala:20)
          	at org.scalatest.WordSpecLike$$anon$1.apply(WordSpecLike.scala:953)
          	at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
          	at org.apache.flink.runtime.jobmanager.JobManagerITCase.withFixture(JobManagerITCase.scala:50)
          
          
          Show
          StephanEwen Stephan Ewen added a comment - Hitting this frequently on local builds as well: Tests run: 21, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1,220.166 sec <<< FAILURE! - in org.apache.flink.runtime.jobmanager.JobManagerITCase The JobManager actor must handle trigger savepoint response for non-existing job(org.apache.flink.runtime.jobmanager.JobManagerITCase) Time elapsed: 1,199.316 sec <<< FAILURE! java.lang.AssertionError: assertion failed: timeout (1199213200030 nanoseconds) during expectMsgClass waiting for class org.apache.flink.runtime.messages.JobManagerMessages$TriggerSavepointFailure at scala.Predef$. assert (Predef.scala:179) at akka.testkit.TestKitBase$class.expectMsgClass_internal(TestKit.scala:423) at akka.testkit.TestKitBase$class.expectMsgType(TestKit.scala:405) at akka.testkit.TestKit.expectMsgType(TestKit.scala:718) at org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34$$anonfun$apply$mcV$sp$35.apply$mcV$sp(JobManagerITCase.scala:772) at org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34$$anonfun$apply$mcV$sp$35.apply(JobManagerITCase.scala:764) at org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34$$anonfun$apply$mcV$sp$35.apply(JobManagerITCase.scala:764) at akka.testkit.TestKitBase$class.within(TestKit.scala:296) at akka.testkit.TestKit.within(TestKit.scala:718) at akka.testkit.TestKitBase$class.within(TestKit.scala:310) at akka.testkit.TestKit.within(TestKit.scala:718) at org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34.apply$mcV$sp(JobManagerITCase.scala:764) at org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34.apply(JobManagerITCase.scala:758) at org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34.apply(JobManagerITCase.scala:758) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.WordSpecLike$$anon$1.apply(WordSpecLike.scala:953) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.apache.flink.runtime.jobmanager.JobManagerITCase.withFixture(JobManagerITCase.scala:50)
          Hide
          till.rohrmann Till Rohrmann added a comment -

          I think the problem is that in the "handle trigger savepoint response for non-existing job" test, we retrieve the leader gateway but do not wait until the JobManager has gained leadership. This is possible when using the standalone leader retrieval service. As a consequence, we can end up sending the TriggerSavepoint savepoint message too early (before the JobManager has gained leadership and, thus, dropping the TriggerSavepoint message).

          Show
          till.rohrmann Till Rohrmann added a comment - I think the problem is that in the "handle trigger savepoint response for non-existing job" test, we retrieve the leader gateway but do not wait until the JobManager has gained leadership. This is possible when using the standalone leader retrieval service. As a consequence, we can end up sending the TriggerSavepoint savepoint message too early (before the JobManager has gained leadership and, thus, dropping the TriggerSavepoint message).
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/3796

          FLINK-6293 [tests] Harden JobManagerITCase

          One of the unit tests in JobManagerITCase starts a MiniCluster and sends a
          LeaderSessionMessage to the JobManager without waiting until the JobManager
          has gained leadership. This can lead to a dropped TriggerSavepoint message
          which will cause the test to deadlock.

          This PR fixes the problem by explicitly waiting for the JobManager to become
          the leader.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink fixJobManagerITCase

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3796.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3796


          commit 5abf141c489154f1fc5650a27b0eb19dbaa29e75
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2017-04-28T08:04:57Z

          FLINK-6293 [tests] Harden JobManagerITCase

          One of the unit tests in JobManagerITCase starts a MiniCluster and sends a
          LeaderSessionMessage to the JobManager without waiting until the JobManager
          has gained leadership. This can lead to a dropped TriggerSavepoint message
          which will cause the test to deadlock.

          This PR fixes the problem by explicitly waiting for the JobManager to become
          the leader.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/3796 FLINK-6293 [tests] Harden JobManagerITCase One of the unit tests in JobManagerITCase starts a MiniCluster and sends a LeaderSessionMessage to the JobManager without waiting until the JobManager has gained leadership. This can lead to a dropped TriggerSavepoint message which will cause the test to deadlock. This PR fixes the problem by explicitly waiting for the JobManager to become the leader. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixJobManagerITCase Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3796.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3796 commit 5abf141c489154f1fc5650a27b0eb19dbaa29e75 Author: Till Rohrmann <trohrmann@apache.org> Date: 2017-04-28T08:04:57Z FLINK-6293 [tests] Harden JobManagerITCase One of the unit tests in JobManagerITCase starts a MiniCluster and sends a LeaderSessionMessage to the JobManager without waiting until the JobManager has gained leadership. This can lead to a dropped TriggerSavepoint message which will cause the test to deadlock. This PR fixes the problem by explicitly waiting for the JobManager to become the leader.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on the issue:

          https://github.com/apache/flink/pull/3796

          Good fix. +1 to merge.

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on the issue: https://github.com/apache/flink/pull/3796 Good fix. +1 to merge.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3796

          Thanks for the review @uce. Merging this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3796 Thanks for the review @uce. Merging this PR.
          Hide
          till.rohrmann Till Rohrmann added a comment -

          Fixed via f3da8f69e99be49068ab4ea3abc5e1c4eba7bf32

          Show
          till.rohrmann Till Rohrmann added a comment - Fixed via f3da8f69e99be49068ab4ea3abc5e1c4eba7bf32
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3796

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3796

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              NicoK Nico Kruber
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development