Issue Details (XML | Word | Printable)

Key: MAPREDUCE-737
Type: New Feature New Feature
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Jie Qiu
Votes: 1
Watchers: 32
Operations

If you were logged in you would be able to see more operations.
Hadoop Map/Reduce

High Availability support for Hadoop

Created: 30/Jun/09 10:05 AM   Updated: 08/Jul/09 04:35 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

Issue Links:
Reference
 

Tags: HA, High Availability


 Description  « Hide
Currently, We look at the HA of Hadoop cluster. We need to consider the NameNode HA as well as Jobtracker HA. For NameNode, we want to build primary/standy or master-slaves pattern to provide NameNode HA. Therefore, we need to consider how to ship log between primary/standby/slaves and how commit "write" operation to NameNode after the agreement among primary/standby/slaves on log. Whether will we use Linux HA package or NameNode-built-in HA package without the help of outter Linux HA package.
After NameNode become high availability, is it necessary to provide HA for Jobtracker? Can Jobtracker persist the states of Jobs and tasks into HA NameNode? Or Jobtracker also needs the same approach from NameNode for HA support.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Steve Loughran added a comment - 01/Jul/09 01:53 PM
The JobTracker can persist its state, but for HA you'd need to do failover in the clients- currently to achieve that you need to switch the replacement to the same hostname/IP address of the original.

The issue of HA for HDFS is raised in HDFS-243


Scott Carey added a comment - 01/Jul/09 06:05 PM
If you really want to go very robust with the JobTracker, you could implement it by embedding ZooKeeper and using it for all the RPC coordination and state tracking. This is a more natural fit than ZK for the NN, IMO.

A job could be a node in ZK, TaskTrackers would be able to see what work they should do by looking in specific nodes – and they can register their status and health as well. Something like a Reduce job getting the map outputs it needs would simply be listing children from a reduce node for a job.
Fault tolerance would come for free, and this form of RPC would likely be more efficient than the current one.

Furthermore, with watches firing off, there would be no ping-response sleep delays which would improve performance, and significantly improve latency for latency sensitive jobs.


Jie Qiu added a comment - 02/Jul/09 06:03 AM
Good suggestion on ZK with Jobtracker.

Do you know who is implementing ZK for the NameNode? How doese it works? Currently, we employed the database-like replication approach to ship log from primary to standby(multiple standbys). We also plan to support different level synchnization, i.e. tight, loose.

For tight mode, we guarantee primary disk write and standby disk write. For loose mode, we guarantee primary disk write and standby receive log.


Jie Qiu added a comment - 02/Jul/09 07:42 AM
Currently, in my experiment, my approach follows:

1. Hook Hadoop for Memory write and Disk write.
Memory write means changes to the BTree, the corresponding log operations (OP_ADD, OP_RENAME, OP_DELETE,....)
Disk write means write the log to disks

2. Replication log from primary node to slaves(one standy at this stage)

3. Replay log in slaves

4. Standby heartbeat primary and take over if it feels primary down

I would like to know other approaches for Hadoop HA on NameNode. What's additional requirements on Hadoop HA?


Steve Loughran added a comment - 02/Jul/09 05:09 PM
whatever you come up with, can it work on virtualised clusters where clocks are jittery?

I encounter lots of interesting timeouts with VMs, depending on the underlying infrastructure and its load, and its easy for anything that assumes that multicast works or that clocks move forward at the same rate to get confused. At this point Lamport will probably post a comment implying that timeouts are essential, and I agree -things just need to work well in a world where time is jerky, and the virtual HDDs may not be flushed when calling sync().


Scott Carey added a comment - 02/Jul/09 05:36 PM

Do you know who is implementing ZK for the NameNode?

I don't even know if such a thing is happening, but given that Google's MapReduce paper and BigTable paper indicate that Chubby is leveraged for HA on various hadoop equivalents I figured that was part of the discussion.

I am in the process of using ZK for an internal project that has some similarities with the JT. Previously it was using ping-response for both state communication and triggering (distributed) events. Switching to ZK proved to be a bit outside the bounds of its typical use case but is very fast, reliable, and light-weight. State management and persistence is much easier, and communication with watches very reliable and fast.

Anyhow, its an option. The reliability, availability, and time sensitivity issues it addresses. Other special aspects of the JT use case would probably drive some ZK features – like an api to return only 'new' children of a node after a specific transaction id, easier out-of-the-box embedding, and some higher level APIs for read-only/watch-only use cases.


Mahadev konar added a comment - 02/Jul/09 05:51 PM

whatever you come up with, can it work on virtualised clusters where clocks are jittery? I encounter lots of interesting timeouts with VMs, depending on the underlying infrastructure and its load, and its easy for anything that assumes that multicast works or that clocks move forward at the same rate to get confused. At this point Lamport will probably post a comment implying that timeouts are essential, and I agree -things just need to work well in a world where time is jerky, and the virtual HDDs may not be flushed when calling sync().

steve,
We have seen ZooKeeper being used on EC2 in virtualized environment. As far I can say, it works for those folks.


Amr Awadallah added a comment - 03/Jul/09 01:27 AM

Do you know who is implementing ZK for the NameNode?

see bookkeeper:

https://issues.apache.org/jira/browse/ZOOKEEPER-276


Bo Dong added a comment - 03/Jul/09 08:26 AM
scott,
We find that the JobTracker only persist the state of competed and dead jobs in DFS, not including the state of in-process jobs.
However, we can use the existing approach by which JT persists the competed and dead jobs to persist the in-process jobs in JT.
I feel it is much simpler than ZK-based approach, since we already make NN high availability.

Bo Dong added a comment - 03/Jul/09 08:55 AM
To Amr:
We found that Bookkeeper said Hadoop used Write-Ahead logging approach.
However, based on my understanding, in 0.19, Hadoop modifies the contents in memory first, and then persists the log. And there is no transaction relationship between modifying memory and persisting log.
For example, in the Create operation (Create a new file entry in the namespace)
[FSNamesystem.java 998] startFileInternal(src, permissions, holder, clientMachine, overwrite, false, replication, blockSize);
// Add a node child to the namespace, and write the log to a buffer.
[FSNamesystem.java 1000] getEditLog().logSync();
// persist the log to the disk.

Hemanth Yamijala added a comment - 03/Jul/09 09:26 AM

We find that the JobTracker only persist the state of competed and dead jobs in DFS, not including the state of in-process jobs.

Umm. This is not true. JobTracker persists the state of running jobs as well in the job history files and can recover from this state upon a restart. This was done as part of HADOOP-3245.


Konstantin Shvachko added a comment - 07/Jul/09 12:31 AM
Jie Qiu> we need to consider how to ship log between primary/standby/slaves

Are you aware of BackupNode introduced in HADOOP-4539?
I thought the log shipping part has been solved by that. Or did you mean anything else?

Jie Qiu> Memory write means changes to the BTree,

What B-tree did you mean?

Jie Qiu> Standby heartbeat primary and take over if it feels primary down

How exactly the standby will decide ("feel" in your terms) that the primary is down?

Also it would be really good to have a document, which gives a general overview and details of your design. You can refer to other jiras for examples. We used to have design documents for large changes like the one you are attempting here.