[MAPREDUCE-225] Fault tolerant Hadoop Job Tracker - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

High availability enterprise system

Description

The Hadoop framework has been designed, in an eort to enhance perfor-
mances, with a single JobTracker (master node). It's responsibilities varies
from managing job submission process, compute the input splits, schedule
the tasks to the slave nodes (TaskTrackers) and monitor their health.
In some environments, like the IBM and Google's Internet-scale com-
puting initiative, there is the need for high-availability, and performances
becomes a secondary issue. In this environments, having a system with
a Single Point of Failure (such as Hadoop's single JobTracker) is a major
concern.
My proposal is to provide a redundant version of Hadoop by adding
support for multiple replicated JobTrackers. This design can be approached
in many dierent ways.

In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0

I wrote an overview of the problem and some approaches to solve it.

I post this to the community to gather feedback on the best way to proceed in my work.

Thank you!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Enhancing the Hadoop MapReduce framework by adding fault.ppt
18/Dec/08 11:16
511 kB
Francesco Salbaroli
FaultTolerantHadoop.pdf
04/Nov/08 11:26
136 kB
Francesco Salbaroli
HADOOP-4586-0.1.patch
18/Dec/08 10:47
35 kB
Francesco Salbaroli
HADOOP-4586v0.3.patch
14/Jan/09 16:22
39 kB
Francesco Salbaroli
jgroups-all.jar
18/Dec/08 10:47
1.92 MB
Francesco Salbaroli

Issue Links

depends upon

HADOOP-3245 Provide ability to persist running jobs (extend HADOOP-1876)

Closed

HADOOP-1876 Persisting completed jobs status

Closed

incorporates

MAPREDUCE-65 TaskTrackers never (re)connect back to the JobTracker if the JobTracker node/machine is changed

Resolved

is duplicated by

YARN-149 [Umbrella] ResourceManager (RM) Fail-over

Resolved

is related to

MAPREDUCE-2288 JT Availability

Resolved

Activity

People

Assignee:: Francesco Salbaroli

Reporter:: Francesco Salbaroli

Votes:: 3 Vote for this issue

Watchers:: 46 Start watching this issue

Dates

Created:: 04/Nov/08 11:23

Updated:: 18/Jul/14 22:27

Resolved:: 18/Jul/14 22:27