Details

    • Type: New Feature New Feature
    • Status: In Progress
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      High availability enterprise system

      Description

      The Hadoop framework has been designed, in an eort to enhance perfor-
      mances, with a single JobTracker (master node). It's responsibilities varies
      from managing job submission process, compute the input splits, schedule
      the tasks to the slave nodes (TaskTrackers) and monitor their health.
      In some environments, like the IBM and Google's Internet-scale com-
      puting initiative, there is the need for high-availability, and performances
      becomes a secondary issue. In this environments, having a system with
      a Single Point of Failure (such as Hadoop's single JobTracker) is a major
      concern.
      My proposal is to provide a redundant version of Hadoop by adding
      support for multiple replicated JobTrackers. This design can be approached
      in many dierent ways.

      In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0

      I wrote an overview of the problem and some approaches to solve it.

      I post this to the community to gather feedback on the best way to proceed in my work.

      Thank you!

      1. HADOOP-4586v0.3.patch
        39 kB
        Francesco Salbaroli
      2. Enhancing the Hadoop MapReduce framework by adding fault.ppt
        511 kB
        Francesco Salbaroli
      3. jgroups-all.jar
        1.92 MB
        Francesco Salbaroli
      4. HADOOP-4586-0.1.patch
        35 kB
        Francesco Salbaroli
      5. FaultTolerantHadoop.pdf
        136 kB
        Francesco Salbaroli

        Issue Links

          Activity

          Hide
          Francesco Salbaroli added a comment -

          Copy of the document with proposals

          Show
          Francesco Salbaroli added a comment - Copy of the document with proposals
          Hide
          Amar Kamat added a comment -

          Francesco, HADOOP-3245 allows jobtracker to (re)start. If the history files are maintained on dfs then the jobtracker can start on any other machine. The only issue that prevents us from achieving complete fault tolerance is HADOOP-4016. Once that gets fixed, we can port the jobtracker to any machine and continue from there. Hence the jobtracker is fault tolerant but the switch is not seamless and porting requires manual intervention.

          Show
          Amar Kamat added a comment - Francesco, HADOOP-3245 allows jobtracker to (re)start. If the history files are maintained on dfs then the jobtracker can start on any other machine. The only issue that prevents us from achieving complete fault tolerance is HADOOP-4016 . Once that gets fixed, we can port the jobtracker to any machine and continue from there. Hence the jobtracker is fault tolerant but the switch is not seamless and porting requires manual intervention.
          Hide
          Francesco Salbaroli added a comment -

          Thanks for your quick response Amar.Just a quick question, is it possible now to have multiple instances of JobTracker running simultaneously on different machines with transparent failover to a redundant copy?

          It is a problem reported by the IBM HiPODS team working on the academic initiative

          Show
          Francesco Salbaroli added a comment - Thanks for your quick response Amar.Just a quick question, is it possible now to have multiple instances of JobTracker running simultaneously on different machines with transparent failover to a redundant copy? It is a problem reported by the IBM HiPODS team working on the academic initiative
          Hide
          steve_l added a comment -

          This is an interesting project which will provide much thesis work, especially from the testing and proof of correctness perspectives.

          -There are some implicit assumptions about the ability of the infrastructure to provision hardware, namely that Cold Standby is inappropriate. If a virtual machine can be provisioned and brought up live within a minute, Cold Standby is surprisingly viable, and, on pay-as-you-go infrastructure, cost-effective.

          The statement that forwarding all state changes to all slaves -Hot Standby is best needs to be qualified with estimated load values and the impact of the events on the network. Is there a cluster size or map/reduce job lifetime in which the state traffic will become an issue, or is it just load on the nodes.

          -How do you intend to implement failover without notifying the task trackers? DNS update?

          -I would like to see some coverage of the election protocol, in particular, how to coordinate such an election over an infrastructure which denies multicast IP (e.g. Amazon EC2).

          -Determining the liveness of the JobTracker is going to be hard. Using Lamport's definitions, it is only live if it is capable of performing work within bounded time, so the true way to determine health is to submit work to the system. Early failures: IPC deadlock, host outage etc, may be detectable early, but some failure modes may be hard to detect. Some of the ongoing work in HADOOP-3628 can act as a starting point, but it is inadequate if you really want "HA".

          -Ignoring HDFS availability, What is going to happen when the farm partitions and both partitions have slaves and a set of task trackers? Who will be in charge?

          -I would have expected some citations for an MSc project; presumably this is an early draft.

          -Take a look at Anubis; this is how we implement partition awareness/HA, though it currently uses Multicast to bootstrap, so will not work on EC2 without adding a new discovery mechanism (simpleDB?)
          http://wiki.smartfrog.org/wiki/display/sf/Anubis

          Show
          steve_l added a comment - This is an interesting project which will provide much thesis work, especially from the testing and proof of correctness perspectives. -There are some implicit assumptions about the ability of the infrastructure to provision hardware, namely that Cold Standby is inappropriate. If a virtual machine can be provisioned and brought up live within a minute, Cold Standby is surprisingly viable, and, on pay-as-you-go infrastructure, cost-effective. The statement that forwarding all state changes to all slaves -Hot Standby is best needs to be qualified with estimated load values and the impact of the events on the network. Is there a cluster size or map/reduce job lifetime in which the state traffic will become an issue, or is it just load on the nodes. -How do you intend to implement failover without notifying the task trackers? DNS update? -I would like to see some coverage of the election protocol, in particular, how to coordinate such an election over an infrastructure which denies multicast IP (e.g. Amazon EC2). -Determining the liveness of the JobTracker is going to be hard. Using Lamport's definitions, it is only live if it is capable of performing work within bounded time, so the true way to determine health is to submit work to the system. Early failures: IPC deadlock, host outage etc, may be detectable early, but some failure modes may be hard to detect. Some of the ongoing work in HADOOP-3628 can act as a starting point, but it is inadequate if you really want "HA". -Ignoring HDFS availability, What is going to happen when the farm partitions and both partitions have slaves and a set of task trackers? Who will be in charge? -I would have expected some citations for an MSc project; presumably this is an early draft. -Take a look at Anubis; this is how we implement partition awareness/HA, though it currently uses Multicast to bootstrap, so will not work on EC2 without adding a new discovery mechanism (simpleDB?) http://wiki.smartfrog.org/wiki/display/sf/Anubis
          Hide
          Francesco Salbaroli added a comment -

          Thank you for the comment Steve.

          1) You're right, but I don't have exact details about the infrastructure on which the Hadoop cluster will run (probably here at IBM will be possible to run Hadoop on top of IBM BlueCloud cloud computing infrastructure). So in an environment with fault-detection and fast, automatic VM provisioning, cold standby can be an option. This point need further investigation and I hope to receive feedback again on this.

          2) In Hot-Standby I supposed a very small number of replicas of the JobTracker (between 2 and 4), so the high network traffic shouldn't be a major concern. But this can be verified only after extensive test sessions.

          3) Yes, DNS update seems to be the best option.

          4) I wasn't aware of the limitation on multicast of Amazon EC2. Two possible solutions: define statically the JobTracker nodes or maintain a list of nodes on DFS or shared cache

          5) I thought about a heartbeat mechanism.

          6) Network partitioning is an issue. Ignoring HDFS, using an election protocol will produce separate smaller fully functional cluster (that is an unwanted feature) but, it should be able to detect multiple running masters and re-run the election to reach another stable state.

          7)This will be part of my M.Sc. but I produced this document only to propose my ideas to the community and gather feedback. And, yes, this is an early draft.

          8) I will take a look at it

          So, thank you again for the interest you show in my work and I hope to hear from you and other community members soon.

          Francesco

          Show
          Francesco Salbaroli added a comment - Thank you for the comment Steve. 1) You're right, but I don't have exact details about the infrastructure on which the Hadoop cluster will run (probably here at IBM will be possible to run Hadoop on top of IBM BlueCloud cloud computing infrastructure). So in an environment with fault-detection and fast, automatic VM provisioning, cold standby can be an option. This point need further investigation and I hope to receive feedback again on this. 2) In Hot-Standby I supposed a very small number of replicas of the JobTracker (between 2 and 4), so the high network traffic shouldn't be a major concern. But this can be verified only after extensive test sessions. 3) Yes, DNS update seems to be the best option. 4) I wasn't aware of the limitation on multicast of Amazon EC2. Two possible solutions: define statically the JobTracker nodes or maintain a list of nodes on DFS or shared cache 5) I thought about a heartbeat mechanism. 6) Network partitioning is an issue. Ignoring HDFS, using an election protocol will produce separate smaller fully functional cluster (that is an unwanted feature) but, it should be able to detect multiple running masters and re-run the election to reach another stable state. 7)This will be part of my M.Sc. but I produced this document only to propose my ideas to the community and gather feedback. And, yes, this is an early draft. 8) I will take a look at it So, thank you again for the interest you show in my work and I hope to hear from you and other community members soon. Francesco
          Hide
          Francesco Salbaroli added a comment -

          What the community think about using the JGroups reliable multicast system for communicating and status monitoring between master and slaves?

          It has 2 major benefits:
          1) Implements reliable multicast communications
          2) Abstracts from the protocol used (can exploit benefits of Multicast UDP, where available, or using TCP where Multicast is forbidden i.e. Amazon EC2)

          Regards,
          Francesco

          Show
          Francesco Salbaroli added a comment - What the community think about using the JGroups reliable multicast system for communicating and status monitoring between master and slaves? It has 2 major benefits: 1) Implements reliable multicast communications 2) Abstracts from the protocol used (can exploit benefits of Multicast UDP, where available, or using TCP where Multicast is forbidden i.e. Amazon EC2) Regards, Francesco
          Hide
          Francesco Salbaroli added a comment -

          To obtain transparent fail-over to a different JobTracker, a dynamic JT resolution mechanism must be provided.

          Show
          Francesco Salbaroli added a comment - To obtain transparent fail-over to a different JobTracker, a dynamic JT resolution mechanism must be provided.
          Hide
          Francesco Salbaroli added a comment -

          I will release a preliminary test version of Fault tolerant Hadoop before 17th Dec.

          Features will include:
          -JGroups 2.6.7 toolkit for reliable multicast communication that is based on a highly configurable protocol stack to adapt to different environments (I will post documentation about it).
          -Completely wraps around the Hadoop sourcecode to minimize modifications in the source tree.
          -Dynamic JobTracker address resolution using HDFS as a support.

          Enhancement in future versions:
          -Higher level of abstraction
          -Better exception handling

          I'll post the sourcecode at the beginning of the next week (hopefully).

          Can I be added to the developers?

          Best regards,
          Francesco

          Show
          Francesco Salbaroli added a comment - I will release a preliminary test version of Fault tolerant Hadoop before 17th Dec. Features will include: -JGroups 2.6.7 toolkit for reliable multicast communication that is based on a highly configurable protocol stack to adapt to different environments (I will post documentation about it). -Completely wraps around the Hadoop sourcecode to minimize modifications in the source tree. -Dynamic JobTracker address resolution using HDFS as a support. Enhancement in future versions: -Higher level of abstraction -Better exception handling I'll post the sourcecode at the beginning of the next week (hopefully). Can I be added to the developers? Best regards, Francesco
          Hide
          Bernd Fondermann added a comment -

          Francesco,

          about how to contribute, please see

          http://wiki.apache.org/hadoop/HowToContribute

          Have fun,

          Bernd

          Show
          Bernd Fondermann added a comment - Francesco, about how to contribute, please see http://wiki.apache.org/hadoop/HowToContribute Have fun, Bernd
          Hide
          Francesco Salbaroli added a comment -

          This is a very preliminary (and tested only locally) release of Fault tolerant Hadoop.

          The hadoop source tree is only slightly modified in the org.apache.hadoop.mapred.TaskTracker class.

          The package containing the fault tolerance wrapper is org.apache.hadoop.mapred.faulttolerant.

          The JGroups library (jgroups-all.jar) must be copied in the lib/ folder.

          To run the FT version of Hadoop:
          1) Configure hadoop-site.xml to match the environment
          2) Format the HDFS filesystem ($HADOOP_HOME/bin/namenode -format)
          3) Run the HDFS daemons (to run locally $HADOOP_HOME/bin/start-dfs.sh)
          4) Run one or more instances of FTJobTracker ($HADOOP_HOME/bin/hadoop org.apache.hadoop.mapred.faulttolerant.FTJobTracker)
          5) Run one or more instance of FTTaskTracker ($HADOOP_HOME/bin/hadoop org.apache.hadoop.mapred.faulttolerant.FTTaskTracker)

          Regards,
          Francesco Salbaroli

          Show
          Francesco Salbaroli added a comment - This is a very preliminary (and tested only locally) release of Fault tolerant Hadoop. The hadoop source tree is only slightly modified in the org.apache.hadoop.mapred.TaskTracker class. The package containing the fault tolerance wrapper is org.apache.hadoop.mapred.faulttolerant. The JGroups library (jgroups-all.jar) must be copied in the lib/ folder. To run the FT version of Hadoop: 1) Configure hadoop-site.xml to match the environment 2) Format the HDFS filesystem ($HADOOP_HOME/bin/namenode -format) 3) Run the HDFS daemons (to run locally $HADOOP_HOME/bin/start-dfs.sh) 4) Run one or more instances of FTJobTracker ($HADOOP_HOME/bin/hadoop org.apache.hadoop.mapred.faulttolerant.FTJobTracker) 5) Run one or more instance of FTTaskTracker ($HADOOP_HOME/bin/hadoop org.apache.hadoop.mapred.faulttolerant.FTTaskTracker) Regards, Francesco Salbaroli
          Hide
          Francesco Salbaroli added a comment -

          This are the charts from a very introductory presentation I have held at the IBM Innovation Centre to give an update about progress of my work.

          Show
          Francesco Salbaroli added a comment - This are the charts from a very introductory presentation I have held at the IBM Innovation Centre to give an update about progress of my work.
          Hide
          steve_l added a comment -

          openoffice says the PPT file is password encrypted. Is that right? Could you upload a PDF?

          Show
          steve_l added a comment - openoffice says the PPT file is password encrypted. Is that right? Could you upload a PDF?
          Hide
          Otis Gospodnetic added a comment -

          Steve, just open it read-only, it works that way.

          Show
          Otis Gospodnetic added a comment - Steve, just open it read-only, it works that way.
          Hide
          Nigel Daley added a comment -

          Francesco, given this is a new feature, it can be integrated into the next major release (which is 0.21). It can't be integrated into a sustaining release.

          Show
          Nigel Daley added a comment - Francesco, given this is a new feature, it can be integrated into the next major release (which is 0.21). It can't be integrated into a sustaining release.
          Hide
          Bo Shi added a comment -

          > In my opinion, it is better to avoid using active copies due to the high
          > complexity of the coordination protocol and, instead, using a master-slave
          > model with soft-state shared between copies through a distributed cache
          > mechanism or saved on HDFS.

          Please forgive me if I'm being naive here (I see that I'm a bit late to the show), but wouldn't using Zookeeper to persist jobtracker state effectively mask this complexity?

          Has anyone explored refactoring the job tracker to use Zookeeper instead of engineering a new master/slave replication system?

          Show
          Bo Shi added a comment - > In my opinion, it is better to avoid using active copies due to the high > complexity of the coordination protocol and, instead, using a master-slave > model with soft-state shared between copies through a distributed cache > mechanism or saved on HDFS. Please forgive me if I'm being naive here (I see that I'm a bit late to the show), but wouldn't using Zookeeper to persist jobtracker state effectively mask this complexity? Has anyone explored refactoring the job tracker to use Zookeeper instead of engineering a new master/slave replication system?
          Hide
          Suresh Srinivas added a comment -

          Sorry for the late comments:
          For a master/slave HA solution, two main problems are:
          1. Mechanism that determines a master in a cluster during startup and failover. Handling loss of quorum, split-brain and fencing in case of split-brain. It also requires comprehensive management tools for configuring, managing and monitoring cluster.
          2. Sharing state information between master and slave, so that a slave node can take over as master.

          Currenly the proposed solution addresses mainly the second problem. I have not seen much information on how the first problem is addressed. While the sharing of information between master and slave can be done in many ways, managing the master/slave cluster is a more complicated problem. Could you please add more information on how the design handles these issues and some notes on how administrator uses this functionality to manage the cluster.

          Also analysis of the impact of job tracker performance due to the introduction of this feature needs to be done.

          > Has anyone explored refactoring the job tracker to use Zookeeper instead of engineering a new master/slave replication system?
          Storing the jobtracker state on Zookeeper may not be a viable option, given that ZooKeeper is intended for storing small amount of data (KB) and JobTracker has lot more data than that to persist.

          Show
          Suresh Srinivas added a comment - Sorry for the late comments: For a master/slave HA solution, two main problems are: 1. Mechanism that determines a master in a cluster during startup and failover. Handling loss of quorum, split-brain and fencing in case of split-brain. It also requires comprehensive management tools for configuring, managing and monitoring cluster. 2. Sharing state information between master and slave, so that a slave node can take over as master. Currenly the proposed solution addresses mainly the second problem. I have not seen much information on how the first problem is addressed. While the sharing of information between master and slave can be done in many ways, managing the master/slave cluster is a more complicated problem. Could you please add more information on how the design handles these issues and some notes on how administrator uses this functionality to manage the cluster. Also analysis of the impact of job tracker performance due to the introduction of this feature needs to be done. > Has anyone explored refactoring the job tracker to use Zookeeper instead of engineering a new master/slave replication system? Storing the jobtracker state on Zookeeper may not be a viable option, given that ZooKeeper is intended for storing small amount of data (KB) and JobTracker has lot more data than that to persist.
          Hide
          dhruba borthakur added a comment -

          Zookeeper might not work well for maintaining JobTracker state (or for that matter, Namenode persistent state) because these processes have lots of metadata to store.

          Show
          dhruba borthakur added a comment - Zookeeper might not work well for maintaining JobTracker state (or for that matter, Namenode persistent state) because these processes have lots of metadata to store.
          Hide
          Francesco Salbaroli added a comment - - edited

          >Sorry for the late comments:
          >For a master/slave HA solution, two main problems are:
          >1. Mechanism that determines a master in a cluster during startup and failover.

          The JGroups library (whose manual can be found here: http://www.jgroups.org/javagroupsnew/docs/manual/pdf/manual.pdf ) handles automatically the election of a group coordinator. The node elected group coordinator is also the master of the cluster. In case of a failure a new group coordinator (and, consequentially, a new cluster master) will be elected.

          >Handling loss of quorum,

          The shared state resides entirely on HDFS (see issues HADOOP-1876 and HADOOP-3245) so, until now, there is no shared soft-state between nodes. However the facilities for managing a shared state are present and can be used in a future update.

          >split-brain and fencing in case of split-brain.

          The JGroups library tries to automatically handle network partitions and merging, but given that:

          • There is no shared soft-state
          • There is only one access point in the whole Hadoop cluster to the HDFS (the NameNode)

          the network partition problem should not be an issue (only one partition at a time can access the HDFS). In future versions a more elegant way of dealing with network partitions should be added.

          > It also >requires comprehensive management tools for configuring, managing and monitoring cluster.

          I am now adding JMX support. After the initial testing phase I will post result and an updated version.

          >2. Sharing state information between master and slave, so that a slave node can take over as master.
          >Currenly the proposed solution addresses mainly the second problem. I have not seen much information on how the first problem is addressed. While the sharing >of information between master and slave can be done in many ways, managing the master/slave cluster is a more complicated problem. Could you please add >more information on how the design handles these issues and some notes on how administrator uses this functionality to manage the cluster.

          I hope I have given an answer to your question. If you need more, feel free to contact me.

          >Also analysis of the impact of job tracker performance due to the introduction of this feature needs to be done.

          I am about to begin the testing phase, results will follow

          Regards,
          Francesco

          Show
          Francesco Salbaroli added a comment - - edited >Sorry for the late comments: >For a master/slave HA solution, two main problems are: >1. Mechanism that determines a master in a cluster during startup and failover. The JGroups library (whose manual can be found here: http://www.jgroups.org/javagroupsnew/docs/manual/pdf/manual.pdf ) handles automatically the election of a group coordinator. The node elected group coordinator is also the master of the cluster. In case of a failure a new group coordinator (and, consequentially, a new cluster master) will be elected. >Handling loss of quorum, The shared state resides entirely on HDFS (see issues HADOOP-1876 and HADOOP-3245 ) so, until now, there is no shared soft-state between nodes. However the facilities for managing a shared state are present and can be used in a future update. >split-brain and fencing in case of split-brain. The JGroups library tries to automatically handle network partitions and merging, but given that: There is no shared soft-state There is only one access point in the whole Hadoop cluster to the HDFS (the NameNode) the network partition problem should not be an issue (only one partition at a time can access the HDFS). In future versions a more elegant way of dealing with network partitions should be added. > It also >requires comprehensive management tools for configuring, managing and monitoring cluster. I am now adding JMX support. After the initial testing phase I will post result and an updated version. >2. Sharing state information between master and slave, so that a slave node can take over as master. >Currenly the proposed solution addresses mainly the second problem. I have not seen much information on how the first problem is addressed. While the sharing >of information between master and slave can be done in many ways, managing the master/slave cluster is a more complicated problem. Could you please add >more information on how the design handles these issues and some notes on how administrator uses this functionality to manage the cluster. I hope I have given an answer to your question. If you need more, feel free to contact me. >Also analysis of the impact of job tracker performance due to the introduction of this feature needs to be done. I am about to begin the testing phase, results will follow Regards, Francesco
          Hide
          dhruba borthakur added a comment -

          >the network partition problem should not be an issue (only one partition at a time can access the HDFS).

          I think the problem still exists. Suppose a network partition occurs between a master and the remainder of the nodes. A new master is elected allright, and the new master will now own the metadata state. The new master will start to update JobTracker metadata strored in HDFS beucae he thinks he is the sole owner of this metadata. At this time, how can you be guaranteed that the old master is not continuing to update the same metadata?

          Show
          dhruba borthakur added a comment - >the network partition problem should not be an issue (only one partition at a time can access the HDFS). I think the problem still exists. Suppose a network partition occurs between a master and the remainder of the nodes. A new master is elected allright, and the new master will now own the metadata state. The new master will start to update JobTracker metadata strored in HDFS beucae he thinks he is the sole owner of this metadata. At this time, how can you be guaranteed that the old master is not continuing to update the same metadata?
          Hide
          Doug Cutting added a comment -

          > Zookeeper might not work well for maintaining JobTracker state (or for that matter, Namenode persistent state) because these processes have lots of metadata to store.

          That's the key concern. Zookeeper's in-memory datastructures would probably take much more space than those in the namenode and/or jobtracker do today. Other than that, Zookeeper seems ideally suited to these tasks. Perhaps if Zookeeper were to support namespace partitioning and rebalancing (hard problems) then it could be used to store such data. It would certainly vastly simplify many things.

          Show
          Doug Cutting added a comment - > Zookeeper might not work well for maintaining JobTracker state (or for that matter, Namenode persistent state) because these processes have lots of metadata to store. That's the key concern. Zookeeper's in-memory datastructures would probably take much more space than those in the namenode and/or jobtracker do today. Other than that, Zookeeper seems ideally suited to these tasks. Perhaps if Zookeeper were to support namespace partitioning and rebalancing (hard problems) then it could be used to store such data. It would certainly vastly simplify many things.
          Hide
          Francesco Salbaroli added a comment -

          Some bugfixes in release 0.3 to work in a distributed environment.

          This version has been tested in a distributed VMware Infrastructure 3 environment (using RHEL 5.2 as a guest OS).

          Show
          Francesco Salbaroli added a comment - Some bugfixes in release 0.3 to work in a distributed environment. This version has been tested in a distributed VMware Infrastructure 3 environment (using RHEL 5.2 as a guest OS).
          Hide
          Sharad Agarwal added a comment -

          If I understand correctly, the current patch doesn't share the state between master and slaves. It relies on HADOOP-3245 for keeping the state. I assume this to work the state has to be kept on HDFS instead of local filesystem. In case a new master is elected, the jobtracker is started using the state from HDFS, right?
          Also, reading the master info from HDFS at frequent interval from each node may not scale well. I think Zookeeper would be better suited in the case where we are just doing master election and keeping watch on the master changes.

          Show
          Sharad Agarwal added a comment - If I understand correctly, the current patch doesn't share the state between master and slaves. It relies on HADOOP-3245 for keeping the state. I assume this to work the state has to be kept on HDFS instead of local filesystem. In case a new master is elected, the jobtracker is started using the state from HDFS, right? Also, reading the master info from HDFS at frequent interval from each node may not scale well. I think Zookeeper would be better suited in the case where we are just doing master election and keeping watch on the master changes.
          Hide
          Bo Shi added a comment -

          Hi,

          I am in China until June 22nd and will only have intermittent access
          to email until then.

          Thanks,
          Bo


          Bo Shi
          207-469-8264

          Show
          Bo Shi added a comment - Hi, I am in China until June 22nd and will only have intermittent access to email until then. Thanks, Bo – Bo Shi 207-469-8264
          Hide
          Bo Shi added a comment -

          Hi,

          I am in China until June 22nd and will only have intermittent access
          to email until then.

          Thanks,
          Bo


          Bo Shi
          207-469-8264

          Show
          Bo Shi added a comment - Hi, I am in China until June 22nd and will only have intermittent access to email until then. Thanks, Bo – Bo Shi 207-469-8264
          Hide
          Hari A V added a comment -

          Hi,

          In my team, we also have been analysing on how to provide HA for Job Tracker. Our approach is also quite similar to Francesco's approach.

          The complete HA solution can be divided to three aspects

          1. Sharing of job related state between Master and Slave job trackers

          This can be achieved with issues HADOOP-1876 and HADOOP-3245.

          2. Failure Detection and Master Election

          We are preferring Zookeeper for this. We had quite bad experience with JGroups in some of our previous projects which include Deadlocks, network traffic overhead etc (May be latest version of JGroups is stable). We were forced to replace jgroups. Zookeeper is the best solution available for leader election. We have seen that Zookeeper is very well used in similar situations in "Katta" project and also some of our internal projects.

          3. How to Notify JobClients and Task Trackers about the new Master, on failure.
          One option would be DNS as mentioned.
          Another option is providing a list of job tracker ips to JobClients and Task trackers. They can silently retry on all available ips in case of failure. At the server side, slave job trackers will not accept any service request. This way we can avoid split brain and network partition scenarios. Zookeeper cluster inherently avoids the split brain issues in leader election.

          We have not yet started our work. Please provide your valuable opinions.

          thanks
          Hari

          Show
          Hari A V added a comment - Hi, In my team, we also have been analysing on how to provide HA for Job Tracker. Our approach is also quite similar to Francesco's approach. The complete HA solution can be divided to three aspects 1. Sharing of job related state between Master and Slave job trackers This can be achieved with issues HADOOP-1876 and HADOOP-3245 . 2. Failure Detection and Master Election We are preferring Zookeeper for this. We had quite bad experience with JGroups in some of our previous projects which include Deadlocks, network traffic overhead etc (May be latest version of JGroups is stable). We were forced to replace jgroups. Zookeeper is the best solution available for leader election. We have seen that Zookeeper is very well used in similar situations in "Katta" project and also some of our internal projects. 3. How to Notify JobClients and Task Trackers about the new Master, on failure. One option would be DNS as mentioned. Another option is providing a list of job tracker ips to JobClients and Task trackers. They can silently retry on all available ips in case of failure. At the server side, slave job trackers will not accept any service request. This way we can avoid split brain and network partition scenarios. Zookeeper cluster inherently avoids the split brain issues in leader election. We have not yet started our work. Please provide your valuable opinions. thanks Hari
          Hide
          Leitao Guo added a comment -

          We also have a solution to JobTracker HA based on LVS:

          (1) Start a JobTracker attatched with a virtual IP. All TaskTrackers and JobClients connect with JT via virtual IP;
          (2) 2 LVS servers (for hot standby) monitor the state of JobTracker;
          (3) When JobTracker down, LVS will trigger a script to start another Jobtracker on another server. Jobs information will be achived with issues HADOOP-1876 and HADOOP-3245.

          This solution need not any changes on JobTracker, but a little complicated deployment.

          Show
          Leitao Guo added a comment - We also have a solution to JobTracker HA based on LVS: (1) Start a JobTracker attatched with a virtual IP. All TaskTrackers and JobClients connect with JT via virtual IP; (2) 2 LVS servers (for hot standby) monitor the state of JobTracker; (3) When JobTracker down, LVS will trigger a script to start another Jobtracker on another server. Jobs information will be achived with issues HADOOP-1876 and HADOOP-3245 . This solution need not any changes on JobTracker, but a little complicated deployment.
          Hide
          Leitao Guo added a comment -

          @Hari, your solution is just the same with HBase master HA, I think it works!

          I think I can contribute to this issue. Should I start a new JIRA about ZK+JT or reassign this issue to me ?

          Show
          Leitao Guo added a comment - @Hari, your solution is just the same with HBase master HA, I think it works! I think I can contribute to this issue. Should I start a new JIRA about ZK+JT or reassign this issue to me ?
          Hide
          Arun C Murthy added a comment -

          Leitao and Hari - apologies for coming in late. I've missed this so far.

          HADOOP-1876 and HADOOP-3245 have had too many issues in the past and we have since moved away from this model - in fact we never deployed either at an reasonable scale due to issues we have seen with them. Also, we have actually have removed a lot of this code in future versions of Hadoop since they didn't work well at all and complicated the JobTracker to a very large extent.

          OTOH, we have been working on a completely revamped architecture for Hadoop Map-Reduce via MAPREDUCE-279. You guys might we interested... also we would love your feedback based on your experiences there. Thanks!

          Show
          Arun C Murthy added a comment - Leitao and Hari - apologies for coming in late. I've missed this so far. HADOOP-1876 and HADOOP-3245 have had too many issues in the past and we have since moved away from this model - in fact we never deployed either at an reasonable scale due to issues we have seen with them. Also, we have actually have removed a lot of this code in future versions of Hadoop since they didn't work well at all and complicated the JobTracker to a very large extent. OTOH, we have been working on a completely revamped architecture for Hadoop Map-Reduce via MAPREDUCE-279 . You guys might we interested... also we would love your feedback based on your experiences there. Thanks!
          Hide
          Leitao Guo added a comment -

          Thanks for your response, Arun.

          Although HADOOP-1876 and HADOOP-3245 do not work well, I think the failover for JobTracker is still considerable.

          If JobTracker on one server down, we need to restart JobTracker or migrate JobTracker to another server. In our scenario, we may not care about whether the job will continue just from the same progress before JobTracker failed, but the automatic failover is needed. Integrating zookeeper with JobTracker is a workable solution for failover I think.

          Show
          Leitao Guo added a comment - Thanks for your response, Arun. Although HADOOP-1876 and HADOOP-3245 do not work well, I think the failover for JobTracker is still considerable. If JobTracker on one server down, we need to restart JobTracker or migrate JobTracker to another server. In our scenario, we may not care about whether the job will continue just from the same progress before JobTracker failed, but the automatic failover is needed. Integrating zookeeper with JobTracker is a workable solution for failover I think.
          Hide
          Wang Xin added a comment -

          I want to know how about this issue,I want to get it HA future in 0.21v,but I want to know how to integration with ZK.someone can tell me how about Jobtrackers with Jgroup or others.

          Show
          Wang Xin added a comment - I want to know how about this issue,I want to get it HA future in 0.21v,but I want to know how to integration with ZK.someone can tell me how about Jobtrackers with Jgroup or others.
          Hide
          Hari A V added a comment -

          hi,

          Sorry for a very late response.
          @Arun: Yes MAPREDUCE-225 is a completely new architecture. May be still need to wait for longer time to get it done. For those who uses 0.20 version and need a simple "availability solution", a much simpler approach would be helpful
          @Leitao: Yes, its similar to HMaster HA. It works. I have finished the development of ZK based framework and integrated with JT. I am in the process of contributing it back. As a first step, i have opened a Jira in Zookeeper for a generic LeaderElectionService (ZOOKEEPER-1080). I will upload the patch soon.

          ZK+JT may not be a full fledged HA solution. But what it tries to address is
          1. Avoid manual intervention during a Jobtracker failure.
          2. Recover and Continue the jobs ( even re-submitting the jobs) without notifying to clients who submitted the job.

          Solution remains very simple as no need to synchronize the "state of the jobs".

          Cons
          -------
          Job may take longer time to finish during failover due to re-submission of jobs

          Please provide suggestions

          -Hari

          Show
          Hari A V added a comment - hi, Sorry for a very late response. @Arun: Yes MAPREDUCE-225 is a completely new architecture. May be still need to wait for longer time to get it done. For those who uses 0.20 version and need a simple "availability solution", a much simpler approach would be helpful @Leitao: Yes, its similar to HMaster HA. It works. I have finished the development of ZK based framework and integrated with JT. I am in the process of contributing it back. As a first step, i have opened a Jira in Zookeeper for a generic LeaderElectionService ( ZOOKEEPER-1080 ). I will upload the patch soon. ZK+JT may not be a full fledged HA solution. But what it tries to address is 1. Avoid manual intervention during a Jobtracker failure. 2. Recover and Continue the jobs ( even re-submitting the jobs) without notifying to clients who submitted the job. Solution remains very simple as no need to synchronize the "state of the jobs". Cons ------- Job may take longer time to finish during failover due to re-submission of jobs Please provide suggestions -Hari
          Hide
          amol kamble added a comment -

          whether this issue is resolved or not I want to work on this...Pls help me

          Show
          amol kamble added a comment - whether this issue is resolved or not I want to work on this...Pls help me

            People

            • Assignee:
              Francesco Salbaroli
              Reporter:
              Francesco Salbaroli
            • Votes:
              3 Vote for this issue
              Watchers:
              51 Start watching this issue

              Dates

              • Created:
                Updated:

                Development