Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5482

QueryableStateClient does not recover from a failed lookup due to a non-running job

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.2.0, 1.3.0
    • Component/s: None
    • Labels:
      None

      Description

      When the QueryableStateClient is used to issue a query but the job is not running yet, its internal lookup result is cached with an IllegalStateException that the job was not found. It does, however, never recover from that.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3120

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3120
          Hide
          uce Ufuk Celebi added a comment -

          Fixed in 1db8102 (release-1.2), 7ff7f43 (master).

          Show
          uce Ufuk Celebi added a comment - Fixed in 1db8102 (release-1.2), 7ff7f43 (master).
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on the issue:

          https://github.com/apache/flink/pull/3120

          Build failures are unrelated, merging this. Thanks!

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on the issue: https://github.com/apache/flink/pull/3120 Build failures are unrelated, merging this. Thanks!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on the issue:

          https://github.com/apache/flink/pull/3120

          Very good catch! Thanks. The change looks good to me.

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on the issue: https://github.com/apache/flink/pull/3120 Very good catch! Thanks. The change looks good to me.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3120#discussion_r96232469

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/query/QueryableStateClient.java —
          @@ -341,7 +341,34 @@ public void shutDown()

          { return previous; }

          } else {

          • return cachedFuture;
            + // do not retain futures which failed as they will remain in
            + // the cache even if the error cause is not present any more
            + // and a new lookup may succeed
            + boolean isFailedFuture = false;
            + if (cachedFuture.isCompleted()) {
            + // find out if the future failed
            + try {
            + cachedFuture.value().get().get();
              • End diff –

          I think you can do `cachedFuture.value().get().isFailure()` instead of catching the Exception here.

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/3120#discussion_r96232469 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/query/QueryableStateClient.java — @@ -341,7 +341,34 @@ public void shutDown() { return previous; } } else { return cachedFuture; + // do not retain futures which failed as they will remain in + // the cache even if the error cause is not present any more + // and a new lookup may succeed + boolean isFailedFuture = false; + if (cachedFuture.isCompleted()) { + // find out if the future failed + try { + cachedFuture.value().get().get(); End diff – I think you can do `cachedFuture.value().get().isFailure()` instead of catching the Exception here.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user NicoK opened a pull request:

          https://github.com/apache/flink/pull/3120

          FLINK-5482 QueryableStateClient does not recover from a failed lookup due to a non-running job

          This PR checks each cached lookup query whether it is complete and removes any failed lookup from the cache in favour of a retry.

          An appropriate unit test is added based on existing test code.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/NicoK/flink flink-5482

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3120.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3120


          commit 086aca06674618cb3be962a883efa62c77aa1c66
          Author: Nico Kruber <nico@data-artisans.com>
          Date: 2017-01-12T15:41:30Z

          FLINK-5482 share more code in QueryableStateITCase

          commit a50e155cf2a1e7e04b160d5226f16f017509799e
          Author: Nico Kruber <nico@data-artisans.com>
          Date: 2017-01-12T15:48:27Z

          FLINK-5482 fix QueryableStateClient not re-issuing a lookup upon failure

          Any failing lookup, e.g. in case the job has not been started yet, previously
          remained in the lookup cache and thus future queries did not retry the lookup
          and failed. This commit changes the lookup caching code so that completed
          and failed futures are removed from the cache and replaced by new lookups.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user NicoK opened a pull request: https://github.com/apache/flink/pull/3120 FLINK-5482 QueryableStateClient does not recover from a failed lookup due to a non-running job This PR checks each cached lookup query whether it is complete and removes any failed lookup from the cache in favour of a retry. An appropriate unit test is added based on existing test code. You can merge this pull request into a Git repository by running: $ git pull https://github.com/NicoK/flink flink-5482 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3120.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3120 commit 086aca06674618cb3be962a883efa62c77aa1c66 Author: Nico Kruber <nico@data-artisans.com> Date: 2017-01-12T15:41:30Z FLINK-5482 share more code in QueryableStateITCase commit a50e155cf2a1e7e04b160d5226f16f017509799e Author: Nico Kruber <nico@data-artisans.com> Date: 2017-01-12T15:48:27Z FLINK-5482 fix QueryableStateClient not re-issuing a lookup upon failure Any failing lookup, e.g. in case the job has not been started yet, previously remained in the lookup cache and thus future queries did not retry the lookup and failed. This commit changes the lookup caching code so that completed and failed futures are removed from the cache and replaced by new lookups.

            People

            • Assignee:
              NicoK Nico Kruber
              Reporter:
              NicoK Nico Kruber
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development