Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5942

Harden ZooKeeperStateHandleStore to deal with corrupted data

    Details

      Description

      The ZooKeeperStateHandleStore cannot handle corrupted Znode data. When calling ZooKeeperStateHandleStore.getAll or getAllSortedByName and reading a node with corrupted data, the whole operation will fail. In such a situation, Flink won't be able to recover because it will read over and over again the same corrupted Znodes (in the recovery case). Therefore, I propose to ignore Znodes whose data cannot be read.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/3447

          FLINK-5942 [checkpoint] Harden ZooKeeperStateHandleStore to handle corrupt data

          If calling ZooKeeperStateHandleStore.getAll or getAllSortedByName as the
          ZooKeeperCompletedCheckpointStore does in the recovery case, the operation will fail
          if there exists a Znode with corrupted data. This will break Flink's recovery
          mechanism, because it will read this node over and over again. In order to solve this
          problem, this commit changes the behaviour such that it ignores corrupted Znodes it
          cannot read.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink hardenZooKeeperStateHandleStore

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3447.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3447


          commit 94177368694ccd007379a54644443c58912e0650
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2017-03-01T17:03:41Z

          FLINK-5942 [checkpoint] Harden ZooKeeperStateHandleStore to handle corrupt data

          If calling ZooKeeperStateHandleStore.getAll or getAllSortedByName as the
          ZooKeeperCompletedCheckpointStore does in the recovery case, the operation will fail
          if there exists a Znode with corrupted data. This will break Flink's recovery
          mechanism, because it will read this node over and over again. In order to solve this
          problem, this commit changes the behaviour such that it ignores corrupted Znodes it
          cannot read.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/3447 FLINK-5942 [checkpoint] Harden ZooKeeperStateHandleStore to handle corrupt data If calling ZooKeeperStateHandleStore.getAll or getAllSortedByName as the ZooKeeperCompletedCheckpointStore does in the recovery case, the operation will fail if there exists a Znode with corrupted data. This will break Flink's recovery mechanism, because it will read this node over and over again. In order to solve this problem, this commit changes the behaviour such that it ignores corrupted Znodes it cannot read. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink hardenZooKeeperStateHandleStore Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3447.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3447 commit 94177368694ccd007379a54644443c58912e0650 Author: Till Rohrmann <trohrmann@apache.org> Date: 2017-03-01T17:03:41Z FLINK-5942 [checkpoint] Harden ZooKeeperStateHandleStore to handle corrupt data If calling ZooKeeperStateHandleStore.getAll or getAllSortedByName as the ZooKeeperCompletedCheckpointStore does in the recovery case, the operation will fail if there exists a Znode with corrupted data. This will break Flink's recovery mechanism, because it will read this node over and over again. In order to solve this problem, this commit changes the behaviour such that it ignores corrupted Znodes it cannot read.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/3449

          [backport-1.2] FLINK-5942 [checkpoint] Harden ZooKeeperStateHandleStore to handle corrupt data

          Backport of #3446 onto the `release-1.2` branch.

          If calling ZooKeeperStateHandleStore.getAll or getAllSortedByName as the
          ZooKeeperCompletedCheckpointStore does in the recovery case, the operation will fail
          if there exists a Znode with corrupted data. This will break Flink's recovery
          mechanism, because it will read this node over and over again. In order to solve this
          problem, this commit changes the behaviour such that it ignores corrupted Znodes it
          cannot read.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink hardenZooKeeperStateHandleStoreBp1.2

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3449.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3449


          commit fe619e224502b83c08ceb0ee3120f7a66a0101d7
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2017-03-01T17:03:41Z

          FLINK-5942 [checkpoint] Harden ZooKeeperStateHandleStore to handle corrupt data

          If calling ZooKeeperStateHandleStore.getAll or getAllSortedByName as the
          ZooKeeperCompletedCheckpointStore does in the recovery case, the operation will fail
          if there exists a Znode with corrupted data. This will break Flink's recovery
          mechanism, because it will read this node over and over again. In order to solve this
          problem, this commit changes the behaviour such that it ignores corrupted Znodes it
          cannot read.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/3449 [backport-1.2] FLINK-5942 [checkpoint] Harden ZooKeeperStateHandleStore to handle corrupt data Backport of #3446 onto the `release-1.2` branch. If calling ZooKeeperStateHandleStore.getAll or getAllSortedByName as the ZooKeeperCompletedCheckpointStore does in the recovery case, the operation will fail if there exists a Znode with corrupted data. This will break Flink's recovery mechanism, because it will read this node over and over again. In order to solve this problem, this commit changes the behaviour such that it ignores corrupted Znodes it cannot read. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink hardenZooKeeperStateHandleStoreBp1.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3449.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3449 commit fe619e224502b83c08ceb0ee3120f7a66a0101d7 Author: Till Rohrmann <trohrmann@apache.org> Date: 2017-03-01T17:03:41Z FLINK-5942 [checkpoint] Harden ZooKeeperStateHandleStore to handle corrupt data If calling ZooKeeperStateHandleStore.getAll or getAllSortedByName as the ZooKeeperCompletedCheckpointStore does in the recovery case, the operation will fail if there exists a Znode with corrupted data. This will break Flink's recovery mechanism, because it will read this node over and over again. In order to solve this problem, this commit changes the behaviour such that it ignores corrupted Znodes it cannot read.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/3452

          [backport-1.1] FLINK-5942 [checkpoint] Harden ZooKeeperStateHandleStore to handle corrupt data

          Backport of #3447 onto `release-1.1` branch.

          If calling ZooKeeperStateHandleStore.getAll or getAllSortedByName as the
          ZooKeeperCompletedCheckpointStore does in the recovery case, the operation will fail
          if there exists a Znode with corrupted data. This will break Flink's recovery
          mechanism, because it will read this node over and over again. In order to solve this
          problem, this commit changes the behaviour such that it ignores corrupted Znodes it
          cannot read.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink hardenZooKeeperStateHandleStoreBp1.1

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3452.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3452



          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/3452 [backport-1.1] FLINK-5942 [checkpoint] Harden ZooKeeperStateHandleStore to handle corrupt data Backport of #3447 onto `release-1.1` branch. If calling ZooKeeperStateHandleStore.getAll or getAllSortedByName as the ZooKeeperCompletedCheckpointStore does in the recovery case, the operation will fail if there exists a Znode with corrupted data. This will break Flink's recovery mechanism, because it will read this node over and over again. In order to solve this problem, this commit changes the behaviour such that it ignores corrupted Znodes it cannot read. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink hardenZooKeeperStateHandleStoreBp1.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3452.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3452
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3447#discussion_r105024493

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java —
          @@ -240,7 +241,7 @@ public int exists(String pathInZooKeeper) throws Exception {
          try

          { return InstantiationUtil.deserializeObject(data, Thread.currentThread().getContextClassLoader()); }

          catch (IOException | ClassNotFoundException e) {

          • throw new Exception("Failed to deserialize state handle from ZooKeeper data from " +
            + throw new FlinkIOException("Failed to deserialize state handle from ZooKeeper data from " +
              • End diff –

          Why not throw a regular I/O exception here?

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/3447#discussion_r105024493 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java — @@ -240,7 +241,7 @@ public int exists(String pathInZooKeeper) throws Exception { try { return InstantiationUtil.deserializeObject(data, Thread.currentThread().getContextClassLoader()); } catch (IOException | ClassNotFoundException e) { throw new Exception("Failed to deserialize state handle from ZooKeeper data from " + + throw new FlinkIOException("Failed to deserialize state handle from ZooKeeper data from " + End diff – Why not throw a regular I/O exception here?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3447#discussion_r105416999

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java —
          @@ -240,7 +241,7 @@ public int exists(String pathInZooKeeper) throws Exception {
          try

          { return InstantiationUtil.deserializeObject(data, Thread.currentThread().getContextClassLoader()); }

          catch (IOException | ClassNotFoundException e) {

          • throw new Exception("Failed to deserialize state handle from ZooKeeper data from " +
            + throw new FlinkIOException("Failed to deserialize state handle from ZooKeeper data from " +
              • End diff –

          The idea was to switch slowly to the new `FlinkExceptions`. However, what I just realize is that `FlinkIOException` does not inherit from `IOException` which is not good if you want to catch all `IOException`. Will revert it back to `IOException`.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/3447#discussion_r105416999 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java — @@ -240,7 +241,7 @@ public int exists(String pathInZooKeeper) throws Exception { try { return InstantiationUtil.deserializeObject(data, Thread.currentThread().getContextClassLoader()); } catch (IOException | ClassNotFoundException e) { throw new Exception("Failed to deserialize state handle from ZooKeeper data from " + + throw new FlinkIOException("Failed to deserialize state handle from ZooKeeper data from " + End diff – The idea was to switch slowly to the new `FlinkExceptions`. However, what I just realize is that `FlinkIOException` does not inherit from `IOException` which is not good if you want to catch all `IOException`. Will revert it back to `IOException`.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3447

          Thanks for the review @StephanEwen. Merging this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3447 Thanks for the review @StephanEwen. Merging this PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3447

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3447
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3449

          Merging this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3449 Merging this PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann closed the pull request at:

          https://github.com/apache/flink/pull/3449

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/3449
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3452

          Merging this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3452 Merging this PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann closed the pull request at:

          https://github.com/apache/flink/pull/3452

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/3452
          Hide
          till.rohrmann Till Rohrmann added a comment -

          1.3.0: ffb8d0518394847ac60252f2140d50a4fd1f0b68
          1.2.1: 39201cf9a5149f5ed14d53a6ecdbd61e23d5eac9
          1.1.5: 97616fd3fbad07cb5793a3dfd114515f9e520464

          Show
          till.rohrmann Till Rohrmann added a comment - 1.3.0: ffb8d0518394847ac60252f2140d50a4fd1f0b68 1.2.1: 39201cf9a5149f5ed14d53a6ecdbd61e23d5eac9 1.1.5: 97616fd3fbad07cb5793a3dfd114515f9e520464

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development