Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-6284

Incorrect sorting of completed checkpoints in ZooKeeperCompletedCheckpointStore

    Details

      Description

      Now all completed checkpoints are sorted in their paths when they are recovered in ZooKeeperCompletedCheckpointStore . In the cases where the latest checkpoint's id is not the largest in lexical order (e.g., "100" is smaller than "99" in lexical order), Flink will not recover from the latest completed checkpoint.

      The problem can be easily observed by setting the checkpoint ids in ZooKeeperCompletedCheckpointStoreITCase#testRecover() to be 99, 100 and 101.

      To fix the problem, we should explicitly sort found checkpoints in their checkpoint ids, without the usage of ZooKeeperStateHandleStore#getAllSortedByName()

        Issue Links

          Activity

          Hide
          StephanEwen Stephan Ewen added a comment -

          Thanks for opening that issue. That is critical indeed.
          Adding Till Rohrmann to this conversation.

          Show
          StephanEwen Stephan Ewen added a comment - Thanks for opening that issue. That is critical indeed. Adding Till Rohrmann to this conversation.
          Hide
          till.rohrmann Till Rohrmann added a comment -

          +1 for sorting wrt checkpoint ids or Znode version number.

          Show
          till.rohrmann Till Rohrmann added a comment - +1 for sorting wrt checkpoint ids or Znode version number.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Can I take this up, if some one is not already working on this?

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Can I take this up, if some one is not already working on this?
          Hide
          rmetzger Robert Metzger added a comment -

          I think nobody is working on the issue right now.
          How fast would you be able to provide a pull request to fix this?
          I'm asking because this is one of the last blockers of the 1.3.0 release, and I would like to create the next release candidate on Monday morning (CET).

          Show
          rmetzger Robert Metzger added a comment - I think nobody is working on the issue right now. How fast would you be able to provide a pull request to fix this? I'm asking because this is one of the last blockers of the 1.3.0 release, and I would like to create the next release candidate on Monday morning (CET).
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          I can try to do this by end of today IST.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - I can try to do this by end of today IST.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user ramkrish86 opened a pull request:

          https://github.com/apache/flink/pull/3881

          FLINK-6284 Incorrect sorting of completed checkpoints in ZooKeeperCompletedCheckpointStore

          ZooKeeperCompletedCheckpointStore

          Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration.
          If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the [How To Contribute guide](http://flink.apache.org/how-to-contribute.html).
          In addition to going through the list, please provide a meaningful description of your changes.

          • [ ] General
          • The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text")
          • The pull request addresses only one issue
          • Each commit in the PR has a meaningful commit message (including the JIRA id)
          • [ ] Documentation
          • Documentation has been added for new functionality
          • Old documentation affected by the pull request has been updated
          • JavaDoc for public methods has been added
          • [ ] Tests & Build
          • Functionality added by the pull request is covered by tests
          • `mvn clean verify` has been executed successfully locally or a Travis build has passed

          Making use of the Zookeeper's getChildren() API directly so that it just creates a list in the sequence order. If we go with the ZKPaths API then we need to do some sorting by converting the List<STring> to List<Long>.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/ramkrish86/flink FLINK-6284

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3881.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3881


          commit 33bf37a2d706af6c8eb6cbe9d58aa3ac9d1f03e0
          Author: Ramkrishna <ramkrishna.s.vasudevan@intel.com>
          Date: 2017-05-12T08:18:16Z

          FLINK-6284 Incorrect sorting of completed checkpoints in
          ZooKeeperCompletedCheckpointStore


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user ramkrish86 opened a pull request: https://github.com/apache/flink/pull/3881 FLINK-6284 Incorrect sorting of completed checkpoints in ZooKeeperCompletedCheckpointStore ZooKeeperCompletedCheckpointStore Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration. If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the [How To Contribute guide] ( http://flink.apache.org/how-to-contribute.html ). In addition to going through the list, please provide a meaningful description of your changes. [ ] General The pull request references the related JIRA issue (" [FLINK-XXX] Jira title text") The pull request addresses only one issue Each commit in the PR has a meaningful commit message (including the JIRA id) [ ] Documentation Documentation has been added for new functionality Old documentation affected by the pull request has been updated JavaDoc for public methods has been added [ ] Tests & Build Functionality added by the pull request is covered by tests `mvn clean verify` has been executed successfully locally or a Travis build has passed Making use of the Zookeeper's getChildren() API directly so that it just creates a list in the sequence order. If we go with the ZKPaths API then we need to do some sorting by converting the List<STring> to List<Long>. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ramkrish86/flink FLINK-6284 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3881.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3881 commit 33bf37a2d706af6c8eb6cbe9d58aa3ac9d1f03e0 Author: Ramkrishna <ramkrishna.s.vasudevan@intel.com> Date: 2017-05-12T08:18:16Z FLINK-6284 Incorrect sorting of completed checkpoints in ZooKeeperCompletedCheckpointStore
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3881#discussion_r116185935

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java —
          @@ -346,11 +346,7 @@ public int exists(String pathInZooKeeper) throws Exception {
          } else {
          // Initial cVersion (number of changes to the children of this node)
          int initialCVersion = stat.getCversion();
          -

          • List<String> children = ZKPaths.getSortedChildren(
          • client.getZookeeperClient().getZooKeeper(),
          • ZKPaths.fixForNamespace(client.getNamespace(), "/"));
            -
            + List<String> children = client.getZookeeperClient().getZooKeeper().getChildren(ZKPaths.fixForNamespace(client.getNamespace(), "/"), false);
              • End diff –

          I think this alone does not work: The JavaDocs of `ZooKeeper#getChildren` say

          > The list of children returned is not sorted and no guarantee is provided as to its natural or lexical order.

          Thus, I assume that it is not safe to simply return the list of children without any further processing.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/3881#discussion_r116185935 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java — @@ -346,11 +346,7 @@ public int exists(String pathInZooKeeper) throws Exception { } else { // Initial cVersion (number of changes to the children of this node) int initialCVersion = stat.getCversion(); - List<String> children = ZKPaths.getSortedChildren( client.getZookeeperClient().getZooKeeper(), ZKPaths.fixForNamespace(client.getNamespace(), "/")); - + List<String> children = client.getZookeeperClient().getZooKeeper().getChildren(ZKPaths.fixForNamespace(client.getNamespace(), "/"), false); End diff – I think this alone does not work: The JavaDocs of `ZooKeeper#getChildren` say > The list of children returned is not sorted and no guarantee is provided as to its natural or lexical order. Thus, I assume that it is not safe to simply return the list of children without any further processing.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ramkrish86 commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3881#discussion_r116188234

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java —
          @@ -346,11 +346,7 @@ public int exists(String pathInZooKeeper) throws Exception {
          } else {
          // Initial cVersion (number of changes to the children of this node)
          int initialCVersion = stat.getCversion();
          -

          • List<String> children = ZKPaths.getSortedChildren(
          • client.getZookeeperClient().getZooKeeper(),
          • ZKPaths.fixForNamespace(client.getNamespace(), "/"));
            -
            + List<String> children = client.getZookeeperClient().getZooKeeper().getChildren(ZKPaths.fixForNamespace(client.getNamespace(), "/"), false);
              • End diff –

          Let me do it my older way. I had a patch but I thought this is better. I checked the javadoc of the ZKPaths only. I will push my initial version of the patch only then, where convert the List<String> to List<Long> and then use that as the sorted one.

          Show
          githubbot ASF GitHub Bot added a comment - Github user ramkrish86 commented on a diff in the pull request: https://github.com/apache/flink/pull/3881#discussion_r116188234 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java — @@ -346,11 +346,7 @@ public int exists(String pathInZooKeeper) throws Exception { } else { // Initial cVersion (number of changes to the children of this node) int initialCVersion = stat.getCversion(); - List<String> children = ZKPaths.getSortedChildren( client.getZookeeperClient().getZooKeeper(), ZKPaths.fixForNamespace(client.getNamespace(), "/")); - + List<String> children = client.getZookeeperClient().getZooKeeper().getChildren(ZKPaths.fixForNamespace(client.getNamespace(), "/"), false); End diff – Let me do it my older way. I had a patch but I thought this is better. I checked the javadoc of the ZKPaths only. I will push my initial version of the patch only then, where convert the List<String> to List<Long> and then use that as the sorted one.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3881#discussion_r116207779

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java —
          @@ -346,17 +346,20 @@ public int exists(String pathInZooKeeper) throws Exception {
          } else {
          // Initial cVersion (number of changes to the children of this node)
          int initialCVersion = stat.getCversion();
          -

          • List<String> children = ZKPaths.getSortedChildren(
          • client.getZookeeperClient().getZooKeeper(),
          • ZKPaths.fixForNamespace(client.getNamespace(), "/"));
            -
          • for (String path : children) {
          • path = "/" + path;
            + List<String> childrenInStr =
            + client.getZookeeperClient().getZooKeeper().
            + getChildren(ZKPaths.fixForNamespace(client.getNamespace(), "/"), false);
            + List<Long> children = new ArrayList<Long>(childrenInStr.size());
            + for(String childNode : childrenInStr) { + children.add(new Long(childNode)); + }
              • End diff –

          Where are the children sorted? Again, I think this only works because `ZooKeeper#getChildren` returns the nodes in the right order in the test case.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/3881#discussion_r116207779 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java — @@ -346,17 +346,20 @@ public int exists(String pathInZooKeeper) throws Exception { } else { // Initial cVersion (number of changes to the children of this node) int initialCVersion = stat.getCversion(); - List<String> children = ZKPaths.getSortedChildren( client.getZookeeperClient().getZooKeeper(), ZKPaths.fixForNamespace(client.getNamespace(), "/")); - for (String path : children) { path = "/" + path; + List<String> childrenInStr = + client.getZookeeperClient().getZooKeeper(). + getChildren(ZKPaths.fixForNamespace(client.getNamespace(), "/"), false); + List<Long> children = new ArrayList<Long>(childrenInStr.size()); + for(String childNode : childrenInStr) { + children.add(new Long(childNode)); + } End diff – Where are the children sorted? Again, I think this only works because `ZooKeeper#getChildren` returns the nodes in the right order in the test case.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3881#discussion_r116207072

          — Diff: pom.xml —
          @@ -101,7 +101,8 @@ under the License.
          <chill.version>0.7.4</chill.version>
          <asm.version>5.0.4</asm.version>
          <zookeeper.version>3.4.6</zookeeper.version>

          • <curator.version>2.12.0</curator.version>
            + <curator.version>2.11.0</curator.version>
              • End diff –

          Why are you downgrading the curator version?

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/3881#discussion_r116207072 — Diff: pom.xml — @@ -101,7 +101,8 @@ under the License. <chill.version>0.7.4</chill.version> <asm.version>5.0.4</asm.version> <zookeeper.version>3.4.6</zookeeper.version> <curator.version>2.12.0</curator.version> + <curator.version>2.11.0</curator.version> End diff – Why are you downgrading the curator version?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3881#discussion_r116208844

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java —
          @@ -346,17 +346,20 @@ public int exists(String pathInZooKeeper) throws Exception {
          } else {
          // Initial cVersion (number of changes to the children of this node)
          int initialCVersion = stat.getCversion();
          -

          • List<String> children = ZKPaths.getSortedChildren(
          • client.getZookeeperClient().getZooKeeper(),
          • ZKPaths.fixForNamespace(client.getNamespace(), "/"));
            -
          • for (String path : children) {
          • path = "/" + path;
            + List<String> childrenInStr =
            + client.getZookeeperClient().getZooKeeper().
            + getChildren(ZKPaths.fixForNamespace(client.getNamespace(), "/"), false);
            + List<Long> children = new ArrayList<Long>(childrenInStr.size());
            + for(String childNode : childrenInStr) {
            + children.add(new Long(childNode));
              • End diff –

          I'm not sure whether we can assume that the children paths are always longs. In the general case this is not true (see `ZooKeeperMesosWorkerStore`).

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/3881#discussion_r116208844 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java — @@ -346,17 +346,20 @@ public int exists(String pathInZooKeeper) throws Exception { } else { // Initial cVersion (number of changes to the children of this node) int initialCVersion = stat.getCversion(); - List<String> children = ZKPaths.getSortedChildren( client.getZookeeperClient().getZooKeeper(), ZKPaths.fixForNamespace(client.getNamespace(), "/")); - for (String path : children) { path = "/" + path; + List<String> childrenInStr = + client.getZookeeperClient().getZooKeeper(). + getChildren(ZKPaths.fixForNamespace(client.getNamespace(), "/"), false); + List<Long> children = new ArrayList<Long>(childrenInStr.size()); + for(String childNode : childrenInStr) { + children.add(new Long(childNode)); End diff – I'm not sure whether we can assume that the children paths are always longs. In the general case this is not true (see `ZooKeeperMesosWorkerStore`).
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ramkrish86 commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3881#discussion_r116211032

          — Diff: pom.xml —
          @@ -101,7 +101,8 @@ under the License.
          <chill.version>0.7.4</chill.version>
          <asm.version>5.0.4</asm.version>
          <zookeeper.version>3.4.6</zookeeper.version>

          • <curator.version>2.12.0</curator.version>
            + <curator.version>2.11.0</curator.version>
              • End diff –

          Oh..My environment was not able to get 2.12.0. So to make things compile I included this change. Will revert it.

          Show
          githubbot ASF GitHub Bot added a comment - Github user ramkrish86 commented on a diff in the pull request: https://github.com/apache/flink/pull/3881#discussion_r116211032 — Diff: pom.xml — @@ -101,7 +101,8 @@ under the License. <chill.version>0.7.4</chill.version> <asm.version>5.0.4</asm.version> <zookeeper.version>3.4.6</zookeeper.version> <curator.version>2.12.0</curator.version> + <curator.version>2.11.0</curator.version> End diff – Oh..My environment was not able to get 2.12.0. So to make things compile I included this change. Will revert it.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ramkrish86 commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3881#discussion_r116211132

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java —
          @@ -346,17 +346,20 @@ public int exists(String pathInZooKeeper) throws Exception {
          } else {
          // Initial cVersion (number of changes to the children of this node)
          int initialCVersion = stat.getCversion();
          -

          • List<String> children = ZKPaths.getSortedChildren(
          • client.getZookeeperClient().getZooKeeper(),
          • ZKPaths.fixForNamespace(client.getNamespace(), "/"));
            -
          • for (String path : children) {
          • path = "/" + path;
            + List<String> childrenInStr =
            + client.getZookeeperClient().getZooKeeper().
            + getChildren(ZKPaths.fixForNamespace(client.getNamespace(), "/"), false);
            + List<Long> children = new ArrayList<Long>(childrenInStr.size());
            + for(String childNode : childrenInStr) { + children.add(new Long(childNode)); + }
              • End diff –

          Here again. It is my bad. I lost my previous changes becauseo f the compile issue. So lost this. I have made a new push already for this sort thing.

          Show
          githubbot ASF GitHub Bot added a comment - Github user ramkrish86 commented on a diff in the pull request: https://github.com/apache/flink/pull/3881#discussion_r116211132 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java — @@ -346,17 +346,20 @@ public int exists(String pathInZooKeeper) throws Exception { } else { // Initial cVersion (number of changes to the children of this node) int initialCVersion = stat.getCversion(); - List<String> children = ZKPaths.getSortedChildren( client.getZookeeperClient().getZooKeeper(), ZKPaths.fixForNamespace(client.getNamespace(), "/")); - for (String path : children) { path = "/" + path; + List<String> childrenInStr = + client.getZookeeperClient().getZooKeeper(). + getChildren(ZKPaths.fixForNamespace(client.getNamespace(), "/"), false); + List<Long> children = new ArrayList<Long>(childrenInStr.size()); + for(String childNode : childrenInStr) { + children.add(new Long(childNode)); + } End diff – Here again. It is my bad. I lost my previous changes becauseo f the compile issue. So lost this. I have made a new push already for this sort thing.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ramkrish86 commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3881#discussion_r116211254

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java —
          @@ -346,17 +346,20 @@ public int exists(String pathInZooKeeper) throws Exception {
          } else {
          // Initial cVersion (number of changes to the children of this node)
          int initialCVersion = stat.getCversion();
          -

          • List<String> children = ZKPaths.getSortedChildren(
          • client.getZookeeperClient().getZooKeeper(),
          • ZKPaths.fixForNamespace(client.getNamespace(), "/"));
            -
          • for (String path : children) {
          • path = "/" + path;
            + List<String> childrenInStr =
            + client.getZookeeperClient().getZooKeeper().
            + getChildren(ZKPaths.fixForNamespace(client.getNamespace(), "/"), false);
            + List<Long> children = new ArrayList<Long>(childrenInStr.size());
            + for(String childNode : childrenInStr) {
            + children.add(new Long(childNode));
              • End diff –

          Ok. I see. I am not sure on this MesosWorker. Using cxid am not sure if we have an API. If so we can direclty use it. Will be back.

          Show
          githubbot ASF GitHub Bot added a comment - Github user ramkrish86 commented on a diff in the pull request: https://github.com/apache/flink/pull/3881#discussion_r116211254 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java — @@ -346,17 +346,20 @@ public int exists(String pathInZooKeeper) throws Exception { } else { // Initial cVersion (number of changes to the children of this node) int initialCVersion = stat.getCversion(); - List<String> children = ZKPaths.getSortedChildren( client.getZookeeperClient().getZooKeeper(), ZKPaths.fixForNamespace(client.getNamespace(), "/")); - for (String path : children) { path = "/" + path; + List<String> childrenInStr = + client.getZookeeperClient().getZooKeeper(). + getChildren(ZKPaths.fixForNamespace(client.getNamespace(), "/"), false); + List<Long> children = new ArrayList<Long>(childrenInStr.size()); + for(String childNode : childrenInStr) { + children.add(new Long(childNode)); End diff – Ok. I see. I am not sure on this MesosWorker. Using cxid am not sure if we have an API. If so we can direclty use it. Will be back.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3881

          Hi @ramkrish86, I might have found an easy way to solve the problem. Take a look at https://github.com/tillrohrmann/flink/commit/5bd499329d68c6f3236b4e89ba25fdb9acb7e422. If this solves the problem, then I would open a PR with it.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3881 Hi @ramkrish86, I might have found an easy way to solve the problem. Take a look at https://github.com/tillrohrmann/flink/commit/5bd499329d68c6f3236b4e89ba25fdb9acb7e422 . If this solves the problem, then I would open a PR with it.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/3884

          FLINK-6284 Correct sorting of completed checkpoints in ZooKeeperStateHandleStore

          In order to store completed checkpoints in an increasing order in ZooKeeper,
          the paths for the completed checkpoint is no generated by
          `String.format("/%019d", checkpointId)` instead of `String.format("/%s", checkpointId)`.
          This makes sure that the converted long will always have the same length with
          leading 0s.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink fixZooKeeperSorting

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3884.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3884


          commit 5bd499329d68c6f3236b4e89ba25fdb9acb7e422
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2017-05-12T12:23:37Z

          FLINK-6284 Correct sorting of completed checkpoints in ZooKeeperStateHandleStore

          In order to store completed checkpoints in an increasing order in ZooKeeper,
          the paths for the completed checkpoint is no generated by
          String.format("/%019d", checkpointId) instead of String.format("/%s", checkpointId).
          This makes sure that the converted long will always have the same length with
          leading 0s.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/3884 FLINK-6284 Correct sorting of completed checkpoints in ZooKeeperStateHandleStore In order to store completed checkpoints in an increasing order in ZooKeeper, the paths for the completed checkpoint is no generated by `String.format("/%019d", checkpointId)` instead of `String.format("/%s", checkpointId)`. This makes sure that the converted long will always have the same length with leading 0s. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixZooKeeperSorting Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3884.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3884 commit 5bd499329d68c6f3236b4e89ba25fdb9acb7e422 Author: Till Rohrmann <trohrmann@apache.org> Date: 2017-05-12T12:23:37Z FLINK-6284 Correct sorting of completed checkpoints in ZooKeeperStateHandleStore In order to store completed checkpoints in an increasing order in ZooKeeper, the paths for the completed checkpoint is no generated by String.format("/%019d", checkpointId) instead of String.format("/%s", checkpointId). This makes sure that the converted long will always have the same length with leading 0s.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ramkrish86 commented on the issue:

          https://github.com/apache/flink/pull/3881

          @tillrohrmann
          Thanks for the new PR. I just executed your change with 101, 99 , 100 as the checkpoint order. In this case 100 should be the latest one though the actual ids are not sorted. But with your change and my earlier commit it will always sort 99, 100, 101.
          Can you take a look at my latest commit, that is based on czxid (as per your suggestion) and I think that makes sense. What ever be the actual id, in the zookeeper what was created recently will be the latest checkpoint. But am not very sure if the checkpointId will really be added in a non-sorted way and can 100 be the latest one (though 101 was also there).

          Show
          githubbot ASF GitHub Bot added a comment - Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3881 @tillrohrmann Thanks for the new PR. I just executed your change with 101, 99 , 100 as the checkpoint order. In this case 100 should be the latest one though the actual ids are not sorted. But with your change and my earlier commit it will always sort 99, 100, 101. Can you take a look at my latest commit, that is based on czxid (as per your suggestion) and I think that makes sense. What ever be the actual id, in the zookeeper what was created recently will be the latest checkpoint. But am not very sure if the checkpointId will really be added in a non-sorted way and can 100 be the latest one (though 101 was also there).
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ramkrish86 commented on the issue:

          https://github.com/apache/flink/pull/3881

          I won't be available for next 2 to 3 hours. So feel free to decide based on your convenience in case you need to make the RC candidate for 1.3 release. I am sorry that I could not make an initial commit that took care of things properly, should have been more careful. Thanks for the opportunity.

          Show
          githubbot ASF GitHub Bot added a comment - Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3881 I won't be available for next 2 to 3 hours. So feel free to decide based on your convenience in case you need to make the RC candidate for 1.3 release. I am sorry that I could not make an initial commit that took care of things properly, should have been more careful. Thanks for the opportunity.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StefanRRichter commented on the issue:

          https://github.com/apache/flink/pull/3884

          LGTM +1

          Show
          githubbot ASF GitHub Bot added a comment - Github user StefanRRichter commented on the issue: https://github.com/apache/flink/pull/3884 LGTM +1
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3884

          Thanks for the review @StefanRRichter. Merging this PR once Travis gives green light.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3884 Thanks for the review @StefanRRichter. Merging this PR once Travis gives green light.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ramkrish86 closed the pull request at:

          https://github.com/apache/flink/pull/3881

          Show
          githubbot ASF GitHub Bot added a comment - Github user ramkrish86 closed the pull request at: https://github.com/apache/flink/pull/3881
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3884

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3884
          Hide
          till.rohrmann Till Rohrmann added a comment -

          1.4.0: ebc13688c9441adf61075707f40b361ee02597f3
          1.3.0: 827d74e69386cff87576972c1b69a16b92b730ae

          Show
          till.rohrmann Till Rohrmann added a comment - 1.4.0: ebc13688c9441adf61075707f40b361ee02597f3 1.3.0: 827d74e69386cff87576972c1b69a16b92b730ae

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              xiaogang.shi Xiaogang Shi
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development