Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4619

JobManager does not answer to client when restore from savepoint fails

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.0, 1.1.1, 1.1.2
    • Fix Version/s: 1.2.0, 1.1.4
    • Component/s: None
    • Labels:
      None

      Description

      When savepoint used is incompatible with currently deployed process, the job manager never returns (jobInfo.notifyClients is not invoked in one of try-catch blocks)

        Issue Links

          Activity

          Hide
          tzulitai Tzu-Li (Gordon) Tai added a comment -

          +1 to fix. Encountered this also. From the client, it would be much more helpful to have information of the cause.

          Show
          tzulitai Tzu-Li (Gordon) Tai added a comment - +1 to fix. Encountered this also. From the client, it would be much more helpful to have information of the cause.
          Hide
          mproch Maciej Prochniak added a comment -

          https://github.com/mproch/flink/commit/ceef33d94058958f36275bfae81d00054e8cc231 - I think this should do the job, but have yet to find a place to write test...

          Show
          mproch Maciej Prochniak added a comment - https://github.com/mproch/flink/commit/ceef33d94058958f36275bfae81d00054e8cc231 - I think this should do the job, but have yet to find a place to write test...
          Hide
          tzulitai Tzu-Li (Gordon) Tai added a comment -

          I think org.apache.flink.runtime.client.JobClientActorTest in the java tests of flink-runtime might be a good place to start looking at.
          Feel free to open a PR when you think you're ready, I can offer to help review

          Show
          tzulitai Tzu-Li (Gordon) Tai added a comment - I think org.apache.flink.runtime.client.JobClientActorTest in the java tests of flink-runtime might be a good place to start looking at. Feel free to open a PR when you think you're ready, I can offer to help review
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user mproch opened a pull request:

          https://github.com/apache/flink/pull/2498

          FLINK-4619 - JobManager does not answer to client when restore from savepoint fails

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/mproch/flink flink-4619

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2498.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2498


          commit 79ed80af927ecededd06ba61d98879b189f064ea
          Author: Maciek Próchniak <mpr@touk.pl>
          Date: 2016-09-14T12:27:27Z

          FLINK-4619 - JobManager does not answer to client when restore from savepoint fails


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user mproch opened a pull request: https://github.com/apache/flink/pull/2498 FLINK-4619 - JobManager does not answer to client when restore from savepoint fails You can merge this pull request into a Git repository by running: $ git pull https://github.com/mproch/flink flink-4619 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2498.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2498 commit 79ed80af927ecededd06ba61d98879b189f064ea Author: Maciek Próchniak <mpr@touk.pl> Date: 2016-09-14T12:27:27Z FLINK-4619 - JobManager does not answer to client when restore from savepoint fails
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2498

          Good idea. Unfortunately, the changed broke some tests:

          • `SavepointITCase.testSubmitWithUnknownSavepointPath`
          • `RescalingITCase.testSavepointRescalingFailureWithNonPartitionedState`

          You can see more in the Travis CI report.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2498 Good idea. Unfortunately, the changed broke some tests: `SavepointITCase.testSubmitWithUnknownSavepointPath` `RescalingITCase.testSavepointRescalingFailureWithNonPartitionedState` You can see more in the Travis CI report.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mproch commented on the issue:

          https://github.com/apache/flink/pull/2498

          Yes, it's a bit more tricky then I expected... These tests probably also need to be changed, because they relied on previous behaviour... What's more with the change some checks on FlinkMiniCluster start to be non-deterministic... I'll commit fixes when I'll fully understand how it works...

          Show
          githubbot ASF GitHub Bot added a comment - Github user mproch commented on the issue: https://github.com/apache/flink/pull/2498 Yes, it's a bit more tricky then I expected... These tests probably also need to be changed, because they relied on previous behaviour... What's more with the change some checks on FlinkMiniCluster start to be non-deterministic... I'll commit fixes when I'll fully understand how it works...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mproch commented on the issue:

          https://github.com/apache/flink/pull/2498

          Hmm... I managed to make travis build pass, but I cannot see how jenkins test failures relate to my change...

          Show
          githubbot ASF GitHub Bot added a comment - Github user mproch commented on the issue: https://github.com/apache/flink/pull/2498 Hmm... I managed to make travis build pass, but I cannot see how jenkins test failures relate to my change...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mproch commented on the issue:

          https://github.com/apache/flink/pull/2498

          @StephanEwen can I retrigger this or whatever?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mproch commented on the issue: https://github.com/apache/flink/pull/2498 @StephanEwen can I retrigger this or whatever?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2498

          No need to re-trigger this. Travis green light is good.

          @uce - Do you think this works well with your latest changes?

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2498 No need to re-trigger this. Travis green light is good. @uce - Do you think this works well with your latest changes?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on the issue:

          https://github.com/apache/flink/pull/2498

          Yes, I merge this later today.

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on the issue: https://github.com/apache/flink/pull/2498 Yes, I merge this later today.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/2498

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2498
          Hide
          uce Ufuk Celebi added a comment -

          Fixed in b05c3c1 (master), 6624348 (release-1.1).

          Show
          uce Ufuk Celebi added a comment - Fixed in b05c3c1 (master), 6624348 (release-1.1).

            People

            • Assignee:
              Unassigned
              Reporter:
              mproch Maciej Prochniak
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development