Uploaded image for project: 'Giraph'
  1. Giraph
  2. GIRAPH-1139

Resuming from checkpoint doesn't work

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.2.0
    • Fix Version/s: None
    • Component/s: bsp
    • Labels:
      None

      Description

      I ran into a couple of issues when trying to get Giraph to resume from checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).

      • If we just wrote a checkpoint, the master expects the workers to checkpoint again, while the workers (correctly) clear the checkpointing flag.
      • When workers restart, they take their task id from the partition number, which stays the same across multiple attempts. This gets transferred to the Netty clientId, and the server starts ignoring messages from restarted workers because it thinks it processed them already.

      I believe I've fixed these issues. I'll send a GitHub PR shortly.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user neggert opened a pull request:

          https://github.com/apache/giraph/pull/30

          GIRAPH-1139 Fix resuming from checkpoint

          A couple of fixes that get resuming from checkpoint working.

          • Set checkpointStatus to NONE in master when restarting from checkpoint.

          Workers already do this, so the job hangs when restarting from checkpoint
          while the master waits for workers to create checkpoints they're never
          going to create.

          • Set unique task id for each worker attempt

          Previously, a worker would reuse the task id from the prior attempt. This
          gets propagated to the Netty client id, which makes the master think it has
          already processed any requests that come from that client, causing it to
          discard them. This obviously causes problems.

          And also a fix for GIRAPH-1136. We will now checkpoint on superstep 0 if checkpointing is enabled. Let me know if you'd rather I sent a separate PR for this.

          Testing:
          Ran custom Label Propagation implementation with checkpointing on a ~5b node graph. Manually killed workers (by logging in to worker node and running `kill -9 <pid>`. Ensured that Giraph successfully resumed from most recent checkpoint.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/neggert/giraph trunk_resume_fixes

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/giraph/pull/30.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #30


          commit 6462d48e46d84cd6aa5ecd5817b0d057ce3a6c1f
          Author: NicEggert <nicholas.eggert@target.com>
          Date: 2017-03-23T20:23:03Z

          Set checkpointStatus to NONE in master when restarting from checkpoint.

          Workers already do this, so the job hangs when restarting from checkpoint
          while the master waits for workers to create checkpoints they're never
          going to create.

          commit 3ed8c18a3bc97c910e364bf7d48d50be25df704c
          Author: NicEggert <nicholas.eggert@target.com>
          Date: 2017-03-23T20:26:02Z

          Checkpoint on superstep 0 if checkpointing is enabled

          commit 74bba4573dbb77242d81352f84969b114db1cb71
          Author: NicEggert <nicholas.eggert@target.com>
          Date: 2017-03-23T20:26:47Z

          Set unique task id for each worker attempt

          Previously, a worker would reuse the task id from the prior attempt. This
          gets propagated to the Netty client id, which makes the master think it has
          already processed any requests that come from that client, causing it to
          discard them. This obviously causes problems.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user neggert opened a pull request: https://github.com/apache/giraph/pull/30 GIRAPH-1139 Fix resuming from checkpoint A couple of fixes that get resuming from checkpoint working. Set checkpointStatus to NONE in master when restarting from checkpoint. Workers already do this, so the job hangs when restarting from checkpoint while the master waits for workers to create checkpoints they're never going to create. Set unique task id for each worker attempt Previously, a worker would reuse the task id from the prior attempt. This gets propagated to the Netty client id, which makes the master think it has already processed any requests that come from that client, causing it to discard them. This obviously causes problems. And also a fix for GIRAPH-1136 . We will now checkpoint on superstep 0 if checkpointing is enabled. Let me know if you'd rather I sent a separate PR for this. Testing: Ran custom Label Propagation implementation with checkpointing on a ~5b node graph. Manually killed workers (by logging in to worker node and running `kill -9 <pid>`. Ensured that Giraph successfully resumed from most recent checkpoint. You can merge this pull request into a Git repository by running: $ git pull https://github.com/neggert/giraph trunk_resume_fixes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/30.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #30 commit 6462d48e46d84cd6aa5ecd5817b0d057ce3a6c1f Author: NicEggert <nicholas.eggert@target.com> Date: 2017-03-23T20:23:03Z Set checkpointStatus to NONE in master when restarting from checkpoint. Workers already do this, so the job hangs when restarting from checkpoint while the master waits for workers to create checkpoints they're never going to create. commit 3ed8c18a3bc97c910e364bf7d48d50be25df704c Author: NicEggert <nicholas.eggert@target.com> Date: 2017-03-23T20:26:02Z Checkpoint on superstep 0 if checkpointing is enabled commit 74bba4573dbb77242d81352f84969b114db1cb71 Author: NicEggert <nicholas.eggert@target.com> Date: 2017-03-23T20:26:47Z Set unique task id for each worker attempt Previously, a worker would reuse the task id from the prior attempt. This gets propagated to the Netty client id, which makes the master think it has already processed any requests that come from that client, causing it to discard them. This obviously causes problems.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user neggert commented on the issue:

          https://github.com/apache/giraph/pull/30

          Could someone please take a look at this? @majakabiljo @edunov

          Show
          githubbot ASF GitHub Bot added a comment - Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 Could someone please take a look at this? @majakabiljo @edunov
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user edunov commented on the issue:

          https://github.com/apache/giraph/pull/30

          Hi Nic, thank you for fixing this and sorry for the delay.
          I'll take a look at this diff other the weekend

          Show
          githubbot ASF GitHub Bot added a comment - Github user edunov commented on the issue: https://github.com/apache/giraph/pull/30 Hi Nic, thank you for fixing this and sorry for the delay. I'll take a look at this diff other the weekend
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user edunov commented on a diff in the pull request:

          https://github.com/apache/giraph/pull/30#discussion_r111767933

          — Diff: giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java —
          @@ -1734,7 +1735,7 @@ private CheckpointStatus getCheckpointStatus(long superstep) {
          if (checkpointFrequency == 0)

          { return CheckpointStatus.NONE; }
          • long firstCheckpoint = INPUT_SUPERSTEP + 1 + checkpointFrequency;
            + long firstCheckpoint = INPUT_SUPERSTEP + 1;
              • End diff –

          What is the reason for changing this? Do you want it to always do checkpoint after the first superstep?

          Show
          githubbot ASF GitHub Bot added a comment - Github user edunov commented on a diff in the pull request: https://github.com/apache/giraph/pull/30#discussion_r111767933 — Diff: giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java — @@ -1734,7 +1735,7 @@ private CheckpointStatus getCheckpointStatus(long superstep) { if (checkpointFrequency == 0) { return CheckpointStatus.NONE; } long firstCheckpoint = INPUT_SUPERSTEP + 1 + checkpointFrequency; + long firstCheckpoint = INPUT_SUPERSTEP + 1; End diff – What is the reason for changing this? Do you want it to always do checkpoint after the first superstep?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user neggert commented on the issue:

          https://github.com/apache/giraph/pull/30

          I'm finding that roughly half of my ~2 hour job time is spent just loading the graph. Resuming from checkpoint lets me skip that step if something fails. This is also why I want to make sure that Giraph checkpoints before starting superstep 0. (It used to work this way, and it's documented in the Giraph book.)

          Show
          githubbot ASF GitHub Bot added a comment - Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 I'm finding that roughly half of my ~2 hour job time is spent just loading the graph. Resuming from checkpoint lets me skip that step if something fails. This is also why I want to make sure that Giraph checkpoints before starting superstep 0. (It used to work this way, and it's documented in the Giraph book.)
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user neggert commented on a diff in the pull request:

          https://github.com/apache/giraph/pull/30#discussion_r111771953

          — Diff: giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java —
          @@ -1734,7 +1735,7 @@ private CheckpointStatus getCheckpointStatus(long superstep) {
          if (checkpointFrequency == 0)

          { return CheckpointStatus.NONE; }
          • long firstCheckpoint = INPUT_SUPERSTEP + 1 + checkpointFrequency;
            + long firstCheckpoint = INPUT_SUPERSTEP + 1;
              • End diff –

          This will actually checkpoint before running superstep 0. You're basically checkpointing the input loading work.

          Show
          githubbot ASF GitHub Bot added a comment - Github user neggert commented on a diff in the pull request: https://github.com/apache/giraph/pull/30#discussion_r111771953 — Diff: giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java — @@ -1734,7 +1735,7 @@ private CheckpointStatus getCheckpointStatus(long superstep) { if (checkpointFrequency == 0) { return CheckpointStatus.NONE; } long firstCheckpoint = INPUT_SUPERSTEP + 1 + checkpointFrequency; + long firstCheckpoint = INPUT_SUPERSTEP + 1; End diff – This will actually checkpoint before running superstep 0. You're basically checkpointing the input loading work.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majakabiljo commented on the issue:

          https://github.com/apache/giraph/pull/30

          Can we get rid of getHostnamePartitionId() to avoid incorrectly using it in the future? I see various other places where taskPartition is used for identifier, do any of them need to be updated too?

          Show
          githubbot ASF GitHub Bot added a comment - Github user majakabiljo commented on the issue: https://github.com/apache/giraph/pull/30 Can we get rid of getHostnamePartitionId() to avoid incorrectly using it in the future? I see various other places where taskPartition is used for identifier, do any of them need to be updated too?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user neggert commented on the issue:

          https://github.com/apache/giraph/pull/30

          There actually is [one instance][1] where I think it's okay to use `getHostnamePartitionId`. This happens in `BspServiceMaster.becomeMaster` when creating a master bid in ZK, before any `TaskInfo` instance is created.

          This does make me realize, though, that I need to make the same change to how the task id is set in `BspServiceMaster`.

          What about just changing how `taskPartition` is set in `BspService`, like so?

          this.taskPartition = (int)getApplicationAttempt() * conf.getMaxWorkers() + getTaskPartition();

          I think this is actually the minimal code change to fix the issue. I don't see anywhere in the code that actually cares about the task partition as anything other than a unique identifier.

          [1]: https://github.com/apache/giraph/blob/trunk/giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java#L805

          Show
          githubbot ASF GitHub Bot added a comment - Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 There actually is [one instance] [1] where I think it's okay to use `getHostnamePartitionId`. This happens in `BspServiceMaster.becomeMaster` when creating a master bid in ZK, before any `TaskInfo` instance is created. This does make me realize, though, that I need to make the same change to how the task id is set in `BspServiceMaster`. What about just changing how `taskPartition` is set in `BspService`, like so? this.taskPartition = (int)getApplicationAttempt() * conf.getMaxWorkers() + getTaskPartition(); I think this is actually the minimal code change to fix the issue. I don't see anywhere in the code that actually cares about the task partition as anything other than a unique identifier. [1] : https://github.com/apache/giraph/blob/trunk/giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java#L805
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user neggert commented on the issue:

          https://github.com/apache/giraph/pull/30

          This is ready for another look. I've replaced partition id with task id. The only place that actually needs a partition id is logging missing workers in `BspServiceMaster`. In that case, it's easy enough to recover the partition id by taking the task id modulo the number of workers.

          Show
          githubbot ASF GitHub Bot added a comment - Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 This is ready for another look. I've replaced partition id with task id. The only place that actually needs a partition id is logging missing workers in `BspServiceMaster`. In that case, it's easy enough to recover the partition id by taking the task id modulo the number of workers.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user neggert commented on the issue:

          https://github.com/apache/giraph/pull/30

          ping @edunov @majakabiljo

          Show
          githubbot ASF GitHub Bot added a comment - Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 ping @edunov @majakabiljo
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majakabiljo commented on a diff in the pull request:

          https://github.com/apache/giraph/pull/30#discussion_r119682732

          — Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java —
          @@ -288,6 +289,9 @@ public BspService(
          throw new RuntimeException(e);
          }

          + this.taskId = (int)getApplicationAttempt() * conf.getMaxWorkers() + conf.getTaskPartition();
          — End diff –

          Please fix checkstyle (make sure 'mvn verify' passes)

          Show
          githubbot ASF GitHub Bot added a comment - Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/30#discussion_r119682732 — Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java — @@ -288,6 +289,9 @@ public BspService( throw new RuntimeException(e); } + this.taskId = (int)getApplicationAttempt() * conf.getMaxWorkers() + conf.getTaskPartition(); — End diff – Please fix checkstyle (make sure 'mvn verify' passes)
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user majakabiljo commented on a diff in the pull request:

          https://github.com/apache/giraph/pull/30#discussion_r119682567

          — Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java —
          @@ -272,7 +273,7 @@ public BspService(
          }
          if (LOG.isInfoEnabled()) {
          LOG.info("BspService: Connecting to ZooKeeper with job " + jobId +

          • ", " + getTaskPartition() + " on " + serverPortList);
            + ", " + getTaskId() + " on " + serverPortList);
              • End diff –

          We need to set taskId before this

          Show
          githubbot ASF GitHub Bot added a comment - Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/30#discussion_r119682567 — Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java — @@ -272,7 +273,7 @@ public BspService( } if (LOG.isInfoEnabled()) { LOG.info("BspService: Connecting to ZooKeeper with job " + jobId + ", " + getTaskPartition() + " on " + serverPortList); + ", " + getTaskId() + " on " + serverPortList); End diff – We need to set taskId before this
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user neggert commented on a diff in the pull request:

          https://github.com/apache/giraph/pull/30#discussion_r122006839

          — Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java —
          @@ -272,7 +273,7 @@ public BspService(
          }
          if (LOG.isInfoEnabled()) {
          LOG.info("BspService: Connecting to ZooKeeper with job " + jobId +

          • ", " + getTaskPartition() + " on " + serverPortList);
            + ", " + getTaskId() + " on " + serverPortList);
              • End diff –

          We can't actually set `taskId` before creating the ZK client, since we need ZK to get the get the application attempt. I changed this to log the partition instead, which we can get from `conf`.

          Show
          githubbot ASF GitHub Bot added a comment - Github user neggert commented on a diff in the pull request: https://github.com/apache/giraph/pull/30#discussion_r122006839 — Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java — @@ -272,7 +273,7 @@ public BspService( } if (LOG.isInfoEnabled()) { LOG.info("BspService: Connecting to ZooKeeper with job " + jobId + ", " + getTaskPartition() + " on " + serverPortList); + ", " + getTaskId() + " on " + serverPortList); End diff – We can't actually set `taskId` before creating the ZK client, since we need ZK to get the get the application attempt. I changed this to log the partition instead, which we can get from `conf`.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user neggert commented on the issue:

          https://github.com/apache/giraph/pull/30

          Fixed @majakabiljo's comments. @edunov

          Show
          githubbot ASF GitHub Bot added a comment - Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 Fixed @majakabiljo's comments. @edunov
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user neggert commented on the issue:

          https://github.com/apache/giraph/pull/30

          Ping @edunov

          Show
          githubbot ASF GitHub Bot added a comment - Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 Ping @edunov
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user edunov commented on the issue:

          https://github.com/apache/giraph/pull/30

          LGTM! Thanks @neggert
          @majakabiljo do you want to land it or should I?

          Show
          githubbot ASF GitHub Bot added a comment - Github user edunov commented on the issue: https://github.com/apache/giraph/pull/30 LGTM! Thanks @neggert @majakabiljo do you want to land it or should I?
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build Giraph-trunk-Commit #1708 (See https://builds.apache.org/job/Giraph-trunk-Commit/1708/)
          GIRAPH-1139 (edunov: http://git-wip-us.apache.org/repos/asf?p=giraph.git&a=commit&h=cc489350eba8db8bc62a4fca77b7ad9aa14bcf3d)

          • (edit) giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java
          • (edit) giraph-core/src/main/java/org/apache/giraph/worker/BspServiceWorker.java
          • (edit) giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Giraph-trunk-Commit #1708 (See https://builds.apache.org/job/Giraph-trunk-Commit/1708/ ) GIRAPH-1139 (edunov: http://git-wip-us.apache.org/repos/asf?p=giraph.git&a=commit&h=cc489350eba8db8bc62a4fca77b7ad9aa14bcf3d ) (edit) giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java (edit) giraph-core/src/main/java/org/apache/giraph/worker/BspServiceWorker.java (edit) giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/giraph/pull/30

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/30
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user edunov commented on the issue:

          https://github.com/apache/giraph/pull/30

          I've committed this one. Let me know if there are any issues.

          Show
          githubbot ASF GitHub Bot added a comment - Github user edunov commented on the issue: https://github.com/apache/giraph/pull/30 I've committed this one. Let me know if there are any issues.
          Hide
          nseggert Nic Eggert added a comment -

          Not an expert, but that build failure looks unrelated.

          Show
          nseggert Nic Eggert added a comment - Not an expert, but that build failure looks unrelated.

            People

            • Assignee:
              Unassigned
              Reporter:
              nseggert Nic Eggert
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development