Details

      Description

      Quite seldomly, JobManagerITCase seems to hang, e.g. see https://api.travis-ci.org/jobs/220888193/log.txt?deansi=true

      The maven watchdog kills the build due to not output being produced within 300s and JobManagerITCase seems to hang in line 772, i.e.

      JobManagerITCase lines 770-772
      // Trigger savepoint for non-existing job
      jobManager.tell(TriggerSavepoint(jobId, Option.apply("any")), testActor)
      val response = expectMsgType[TriggerSavepointFailure](deadline.timeLeft)
      

      Although the (downloaded) logs do not quite allow a precise mapping to this test case, it looks as if the following block may be related:

      09:34:47,684 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster         - Akka ask timeout set to 100s
      09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster         - Disabled queryable state server
      09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster         - Starting FlinkMiniCluster.
      09:34:47,809 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
      09:34:47,837 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /tmp/blobStore-eab23d04-ea18-4dc5-b1df-fcf9fc295062
      09:34:47,838 WARN  org.apache.flink.runtime.net.SSLUtils                         - Not a SSL socket, will skip setting tls version and cipher suites.
      09:34:47,839 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:36745 - max concurrent requests: 50 - max backlog: 1000
      09:34:47,840 INFO  org.apache.flink.runtime.metrics.MetricRegistry               - No metrics reporter configured, no metrics will be exposed/reported.
      09:34:47,850 INFO  org.apache.flink.runtime.testingUtils.TestingMemoryArchivist  - Started memory archivist akka://flink/user/archive_1
      09:34:47,860 INFO  org.apache.flink.runtime.testutils.TestingResourceManager     - Trying to associate with JobManager leader akka://flink/user/jobmanager_1
      09:34:47,861 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager       - Starting JobManager at akka://flink/user/jobmanager_1.
      09:34:47,862 WARN  org.apache.flink.runtime.testingUtils.TestingJobManager       - Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,TriggerSavepoint(6e813070338a23b0ff571646bca56521,Some(any))) because there is currently no valid leader id known.
      09:34:47,862 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager       - JobManager akka://flink/user/jobmanager_1 was granted leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
      09:34:47,867 INFO  org.apache.flink.runtime.testutils.TestingResourceManager     - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager_1#-652927556] - leader session 00000000-0000-0000-0000-000000000000
      

      If so, then this may be related to FLINK-6287 and may possibly even be a duplicate.

      What is strange though is that the timeout for the expected message to arrive is no more than 2m and thus the test should properly fail within 300s.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                till.rohrmann Till Rohrmann
                Reporter:
                NicoK Nico Kruber
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: