Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-11997

ConcurrentModificationException: ZooKeeper unexpectedly modified

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.8.0
    • None
    • None

    Description

      Trying to rescale a job running in a k8s job cluster via

      flink modify 00000000000000000000000000000000 -p 2 -m localhost:30081

      Rescaling works fine if HA is off. Taking a savepoint and restarting from one also works fine, even with HA turned on. But rescaling by modifying the job with HA on always fails as shown below:

      Caused by: org.apache.flink.util.FlinkException: Failed to rescale the job 00000000000000000000000000000000.

              ... 21 more

      Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmaster.exceptions.JobModificationException: Could not restore from temporary rescaling savepoint. This might indicate that the savepoint s3://state/savepoints/savepoint-000000-2fa7fd5dabb2 got corrupted. Deleting this savepoint as a precaution.

              at org.apache.flink.runtime.jobmaster.JobMaster.lambda$rescaleOperators$4(JobMaster.java:470)

              at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)

              at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)

              ... 18 more

      Caused by: org.apache.flink.runtime.jobmaster.exceptions.JobModificationException: Could not restore from temporary rescaling savepoint. This might indicate that the savepoint s3://state/savepoints/savepoint-000000-2fa7fd5dabb2 got corrupted. Deleting this savepoint as a precaution.

              at org.apache.flink.runtime.jobmaster.JobMaster.lambda$restoreExecutionGraphFromRescalingSavepoint$18(JobMaster.java:1433)

              at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)

              at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

              at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

              at java.util.concurrent.FutureTask.run(FutureTask.java:266)

              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

              at java.lang.Thread.run(Thread.java:748)

      Caused by: java.util.ConcurrentModificationException: ZooKeeper unexpectedly modified

              at org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.addAndLock(ZooKeeperStateHandleStore.java:159)

              at org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.addCheckpoint(ZooKeeperCompletedCheckpointStore.java:216)

              at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1106)

              at org.apache.flink.runtime.jobmaster.JobMaster.tryRestoreExecutionGraphFromSavepoint(JobMaster.java:1251)

              at org.apache.flink.runtime.jobmaster.JobMaster.lambda$restoreExecutionGraphFromRescalingSavepoint$18(JobMaster.java:1413)

              ... 10 more

      Caused by: org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists

              at org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException.create(KeeperException.java:119)

              at org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1006)

              at org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)

              at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159)

              at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl.access$200(CuratorTransactionImpl.java:44)

              at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:129)

              at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:125)

              at org.apache.flink.shaded.curator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)

              at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorTransactionImpl.commit(CuratorTransactionImpl.java:122)

              at org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.addAndLock(ZooKeeperStateHandleStore.java:153)

              ... 14 more

      Attachments

        1. FAILURE
          326 kB
          David Anderson

        Issue Links

          Activity

            People

              Unassigned Unassigned
              alpinegizmo David Anderson
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: