[GIRAPH-972] Race condition in checkpointing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.2.0
Component/s: None
Labels:
None

Description

Couple of issues noticed with checkpointing of large jobs:
1) Task ID of master appears to be important. In most cases it is 0, however sometimes it is not and as we can not control it checkpointing should not depend on it.

2) Race condition happens on master when worker dies:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201411061513.38895_0001/_applicationAttemptsDir/0/_superstepDir/9/_workerHealthyDir/hadoop4921.prn2.facebook.com_3
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
at org.apache.giraph.zk.ZooKeeperExt.getData(ZooKeeperExt.java:470)
at org.apache.giraph.utils.WritableUtils.readFieldsFromZnode(WritableUtils.java:126)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Sergey Edunov

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 18/Dec/14 18:49

Updated:: 14/Oct/16 00:58

Resolved:: 18/Dec/14 23:15