|
If you really want to go very robust with the JobTracker, you could implement it by embedding ZooKeeper and using it for all the RPC coordination and state tracking. This is a more natural fit than ZK for the NN, IMO.
A job could be a node in ZK, TaskTrackers would be able to see what work they should do by looking in specific nodes – and they can register their status and health as well. Something like a Reduce job getting the map outputs it needs would simply be listing children from a reduce node for a job. Furthermore, with watches firing off, there would be no ping-response sleep delays which would improve performance, and significantly improve latency for latency sensitive jobs. Good suggestion on ZK with Jobtracker.
Do you know who is implementing ZK for the NameNode? How doese it works? Currently, we employed the database-like replication approach to ship log from primary to standby(multiple standbys). We also plan to support different level synchnization, i.e. tight, loose. For tight mode, we guarantee primary disk write and standby disk write. For loose mode, we guarantee primary disk write and standby receive log. Currently, in my experiment, my approach follows:
1. Hook Hadoop for Memory write and Disk write. 2. Replication log from primary node to slaves(one standy at this stage) 3. Replay log in slaves 4. Standby heartbeat primary and take over if it feels primary down I would like to know other approaches for Hadoop HA on NameNode. What's additional requirements on Hadoop HA? whatever you come up with, can it work on virtualised clusters where clocks are jittery?
I encounter lots of interesting timeouts with VMs, depending on the underlying infrastructure and its load, and its easy for anything that assumes that multicast works or that clocks move forward at the same rate to get confused. At this point Lamport will probably post a comment implying that timeouts are essential, and I agree -things just need to work well in a world where time is jerky, and the virtual HDDs may not be flushed when calling sync().
I don't even know if such a thing is happening, but given that Google's MapReduce paper and BigTable paper indicate that Chubby is leveraged for HA on various hadoop equivalents I figured that was part of the discussion. I am in the process of using ZK for an internal project that has some similarities with the JT. Previously it was using ping-response for both state communication and triggering (distributed) events. Switching to ZK proved to be a bit outside the bounds of its typical use case but is very fast, reliable, and light-weight. State management and persistence is much easier, and communication with watches very reliable and fast. Anyhow, its an option. The reliability, availability, and time sensitivity issues it addresses. Other special aspects of the JT use case would probably drive some ZK features – like an api to return only 'new' children of a node after a specific transaction id, easier out-of-the-box embedding, and some higher level APIs for read-only/watch-only use cases.
steve,
see bookkeeper: scott,
We find that the JobTracker only persist the state of competed and dead jobs in DFS, not including the state of in-process jobs. However, we can use the existing approach by which JT persists the competed and dead jobs to persist the in-process jobs in JT. I feel it is much simpler than ZK-based approach, since we already make NN high availability. To Amr:
We found that Bookkeeper said Hadoop used Write-Ahead logging approach. However, based on my understanding, in 0.19, Hadoop modifies the contents in memory first, and then persists the log. And there is no transaction relationship between modifying memory and persisting log. For example, in the Create operation (Create a new file entry in the namespace) [FSNamesystem.java 998] startFileInternal(src, permissions, holder, clientMachine, overwrite, false, replication, blockSize); // Add a node child to the namespace, and write the log to a buffer. [FSNamesystem.java 1000] getEditLog().logSync(); // persist the log to the disk.
Umm. This is not true. JobTracker persists the state of running jobs as well in the job history files and can recover from this state upon a restart. This was done as part of Jie Qiu> we need to consider how to ship log between primary/standby/slaves
Are you aware of BackupNode introduced in Jie Qiu> Memory write means changes to the BTree, What B-tree did you mean? Jie Qiu> Standby heartbeat primary and take over if it feels primary down How exactly the standby will decide ("feel" in your terms) that the primary is down? Also it would be really good to have a document, which gives a general overview and details of your design. You can refer to other jiras for examples. We used to have design documents for large changes like the one you are attempting here. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The issue of HA for HDFS is raised in HDFS-243