Uploaded image for project: 'Giraph (Retired)'
  1. Giraph (Retired)
  2. GIRAPH-972

Race condition in checkpointing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.2.0
    • None
    • None

    Description

      Couple of issues noticed with checkpointing of large jobs:
      1) Task ID of master appears to be important. In most cases it is 0, however sometimes it is not and as we can not control it checkpointing should not depend on it.

      2) Race condition happens on master when worker dies:
      org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201411061513.38895_0001/_applicationAttemptsDir/0/_superstepDir/9/_workerHealthyDir/hadoop4921.prn2.facebook.com_3
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
      at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
      at org.apache.giraph.zk.ZooKeeperExt.getData(ZooKeeperExt.java:470)
      at org.apache.giraph.utils.WritableUtils.readFieldsFromZnode(WritableUtils.java:126)

      Attachments

        Activity

          People

            Unassigned Unassigned
            edunov Sergey Edunov
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: