Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Later
    • Affects Version/s: 0.23.0
    • Fix Version/s: None
    • Component/s: mrv2
    • Labels:
      None
    • Environment:

      All Unix-Environments

      Description

      MPI is commonly used for many machine-learning applications. OpenMPI (http://www.open-mpi.org/) is a popular BSD-licensed version of MPI. In the past, running MPI application on a Hadoop cluster was achieved using Hadoop Streaming (http://videolectures.net/nipsworkshops2010_ye_gbd/), but it was kludgy. After the resource-manager separation from JobTracker in Hadoop, we have all the tools needed to make MPI a first-class citizen on a Hadoop cluster. I am currently working on the patch to make MPI an application-master. Initial version of this patch will be available soon (hopefully before September 10.) This jira will track the development of Hamster: The application master for MPI.

        Activity

        Hide
        Madhurima added a comment -

        Hi,

        iam interested in Hamster and want to build and test the sources.
        Please let me know f the sources of Hamster are available for build and testing.

        thanks,
        Madhurima

        Show
        Madhurima added a comment - Hi, iam interested in Hamster and want to build and test the sources. Please let me know f the sources of Hamster are available for build and testing. thanks, Madhurima
        Hide
        Madhurima added a comment -

        Hi,

        Iam interested in the project Hamster. Please let me know if the sources are available for build and test.

        Show
        Madhurima added a comment - Hi, Iam interested in the project Hamster. Please let me know if the sources are available for build and test.
        Hide
        Hui Li added a comment -

        Hi, Zhuoluo (Clark) Yang

        Now I can run the mpi program with mpich2-yarn by following the updated README.md. I am interested in this project, and look forward to hearing from you again.

        Thank you very much.

        Show
        Hui Li added a comment - Hi, Zhuoluo (Clark) Yang Now I can run the mpi program with mpich2-yarn by following the updated README.md. I am interested in this project, and look forward to hearing from you again. Thank you very much.
        Hide
        Zhuoluo (Clark) Yang added a comment -

        Hi, Hui Li

        I had updated README.md on https://github.com/clarkyzl/mpich2-yarn, will that help?

        Thank you.

        Show
        Zhuoluo (Clark) Yang added a comment - Hi, Hui Li I had updated README.md on https://github.com/clarkyzl/mpich2-yarn , will that help? Thank you.
        Hide
        Zhuoluo (Clark) Yang added a comment -

        Hi Arun C Murthy

        Sorry for late respone, I was on my vocation recently.

        1. The stage is that we store almost all the data on the HDFS cluster, all the mpi jobs need these data, so we try to run MPI on it. (Moving the mpi computation to where data stores) There is a 100-node cluster runs MPICH2 on Yarn, and we run simple PLDA/LA application on it.
        2. I am not good at license problem, and I think Apache linsence is great and friendly to comercial usage. Should I list that on the README.md?
        3. We are really insterested in contributing the project to ASF. The design is simple ,make each nodemanger runs a "smpd" daemon and make application master run "mpiexec" to submit mpi jobs. The biggest problem here, I think, is the yarn scheduler do not support gang secheduling algorithm, and the scheduling is very very slow. I think if we want to run the MPI fast, we really need collaborators to help us modify the scheduler.

        Thank you very much.

        Show
        Zhuoluo (Clark) Yang added a comment - Hi Arun C Murthy Sorry for late respone, I was on my vocation recently. 1. The stage is that we store almost all the data on the HDFS cluster, all the mpi jobs need these data, so we try to run MPI on it. (Moving the mpi computation to where data stores) There is a 100-node cluster runs MPICH2 on Yarn, and we run simple PLDA/LA application on it. 2. I am not good at license problem, and I think Apache linsence is great and friendly to comercial usage. Should I list that on the README.md? 3. We are really insterested in contributing the project to ASF. The design is simple ,make each nodemanger runs a "smpd" daemon and make application master run "mpiexec" to submit mpi jobs. The biggest problem here, I think, is the yarn scheduler do not support gang secheduling algorithm, and the scheduling is very very slow. I think if we want to run the MPI fast, we really need collaborators to help us modify the scheduler. Thank you very much.
        Hide
        Hui Li added a comment -

        Hi Clark,

        I am recently trying to run your mpich2-yarn jar file with a simple mpi routine, but didn't get success.
        Please can you send a short readme about how to run the mpich2-yarn program.

        Regards,
        Hui

        Show
        Hui Li added a comment - Hi Clark, I am recently trying to run your mpich2-yarn jar file with a simple mpi routine, but didn't get success. Please can you send a short readme about how to run the mpich2-yarn program. Regards, Hui
        Hide
        Arun C Murthy added a comment -

        Zhuoluo (Clark) Yang Nice!

        I've had a number of people ask me about MPI and YARN... this is very exciting, thanks!

        Could you please share some more info?

        1. What stage is this in? How much usage do you see of this at Taobao?
        2. The github page for the project doesn't list the license, could you please clarify?
        3. Would you be interested in contributing this to the ASF? I'm sure you'll see a lot of collaborators here...

        Thanks again!

        Show
        Arun C Murthy added a comment - Zhuoluo (Clark) Yang Nice! I've had a number of people ask me about MPI and YARN... this is very exciting, thanks! Could you please share some more info? What stage is this in? How much usage do you see of this at Taobao? The github page for the project doesn't list the license, could you please clarify? Would you be interested in contributing this to the ASF? I'm sure you'll see a lot of collaborators here... Thanks again!
        Hide
        Zhuoluo (Clark) Yang added a comment -

        Hi, we've built a demo of MPI on yarn with mpich2.
        Any one interested in that please feel free to try and folk.
        https://github.com/clarkyzl/mpich2-yarn

        Show
        Zhuoluo (Clark) Yang added a comment - Hi, we've built a demo of MPI on yarn with mpich2. Any one interested in that please feel free to try and folk. https://github.com/clarkyzl/mpich2-yarn
        Hide
        Ralph Castain added a comment -

        I am leaving Greenplum/Pivotal on May 17th and am no longer involved in Hamster.

        Show
        Ralph Castain added a comment - I am leaving Greenplum/Pivotal on May 17th and am no longer involved in Hamster.
        Hide
        Ralph Castain added a comment -

        Sorry for the silence - the lawyers have not released the code yet, and I've been uncomfortable providing info in that absence. Now that Greenplum has reorganized into "Pivotal", that might change.

        I defer all further questions on this matter to Milind as I am leaving Greenplum/Pivotal on May 17th.

        Show
        Ralph Castain added a comment - Sorry for the silence - the lawyers have not released the code yet, and I've been uncomfortable providing info in that absence. Now that Greenplum has reorganized into "Pivotal", that might change. I defer all further questions on this matter to Milind as I am leaving Greenplum/Pivotal on May 17th.
        Hide
        Karthik Ramachandran added a comment -

        Has there been any progress on this ? Any (even rough code) that we could peak at?

        Show
        Karthik Ramachandran added a comment - Has there been any progress on this ? Any (even rough code) that we could peak at?
        Hide
        Vinod Kumar Vavilapalli added a comment -

        I saw a presentation from Ralph a while back but I thought he himself was going to update this JIRA some time soon. It doesn't seem like that is going to happen, so here it goes: What is MR+? - Open MPI -> http://www.open-mpi.org/video/mrplus/Greenplum_RalphCastain-2up.pdf

        Seeing interest on the lists and others on this very JIRA, I do believe there is still merit in having some implementation of Mpi on top of YARN. If others agree, I'll go ahead and file a ticket.

        Show
        Vinod Kumar Vavilapalli added a comment - I saw a presentation from Ralph a while back but I thought he himself was going to update this JIRA some time soon. It doesn't seem like that is going to happen, so here it goes: What is MR+? - Open MPI -> http://www.open-mpi.org/video/mrplus/Greenplum_RalphCastain-2up.pdf Seeing interest on the lists and others on this very JIRA, I do believe there is still merit in having some implementation of Mpi on top of YARN. If others agree, I'll go ahead and file a ticket.
        Hide
        Arun C Murthy added a comment -

        I have the same questions as Harsh.

        On May 17, 2012 Ralph said he was close to committing this to OpenMPI, as mentioned on this jira: http://s.apache.org/uY

        Where is this 'bile-spewing' and when did it start?

        I'm still looking forward to playing with this.

        Show
        Arun C Murthy added a comment - I have the same questions as Harsh. On May 17, 2012 Ralph said he was close to committing this to OpenMPI, as mentioned on this jira: http://s.apache.org/uY Where is this 'bile-spewing' and when did it start? I'm still looking forward to playing with this.
        Hide
        Harsh J added a comment -

        Where exactly did all that naming you refer to, happen? I've not noticed it on the lists and there's been a few asks there as well (IIRC), but no negativism ever came in on its responses. I do not see any 'bile-spewing' on this very ticket either. So what "community" are you pointing this onto?

        Thanks for still working on getting this available though, there are several people interested in this!

        Show
        Harsh J added a comment - Where exactly did all that naming you refer to, happen? I've not noticed it on the lists and there's been a few asks there as well (IIRC), but no negativism ever came in on its responses. I do not see any 'bile-spewing' on this very ticket either. So what "community" are you pointing this onto? Thanks for still working on getting this available though, there are several people interested in this!
        Hide
        Milind Bhandarkar added a comment -

        The "community" just created a huge issue for me to make this available to the community, by naming us "anti-community". So, while I am trying to get this available to the community, I have to now a few more obstacles to overcome. Please bear with me, or better still try to stop the "community" to stop their bile-spewing against us, so that we can navigate through this mess.

        Show
        Milind Bhandarkar added a comment - The "community" just created a huge issue for me to make this available to the community, by naming us "anti-community". So, while I am trying to get this available to the community, I have to now a few more obstacles to overcome. Please bear with me, or better still try to stop the "community" to stop their bile-spewing against us, so that we can navigate through this mess.
        Hide
        qiangliu added a comment -

        Where could download the available version? thanks.

        Show
        qiangliu added a comment - Where could download the available version? thanks.
        Hide
        Rahul V added a comment -

        So where is the code for the same? Any pointers?

        Show
        Rahul V added a comment - So where is the code for the same? Any pointers?
        Hide
        Zhuoluo (Clark) Yang added a comment -

        I am curious to see the pointer to the OMPI nightly snapshot link. Where the code can be obtained?
        Can you please share information on where it was committed to OpenMPI and how folks can try it out?

        Show
        Zhuoluo (Clark) Yang added a comment - I am curious to see the pointer to the OMPI nightly snapshot link. Where the code can be obtained? Can you please share information on where it was committed to OpenMPI and how folks can try it out?
        Hide
        Arun C Murthy added a comment -

        I was curious to see the code. Which branch has it been committed to?

        To clarify, there was no code committed to Hadoop itself.

        Ralph - Can you please share information on where it was committed to OpenMPI and how folks can try it out? Thanks.

        Show
        Arun C Murthy added a comment - I was curious to see the code. Which branch has it been committed to? To clarify, there was no code committed to Hadoop itself. Ralph - Can you please share information on where it was committed to OpenMPI and how folks can try it out? Thanks.
        Hide
        Varun Thacker added a comment -

        Hi Ralph,

        I was curious to see the code. Which branch has it been committed to?

        If there is more work needed on other parts related to Hamster i would love to contribute.

        Show
        Varun Thacker added a comment - Hi Ralph, I was curious to see the code. Which branch has it been committed to? If there is more work needed on other parts related to Hamster i would love to contribute.
        Hide
        Arun C Murthy added a comment -

        Great, thanks Raplh - really look forward to it!

        Also, you/Milind might want to send a note out to Hadoop lists (general@) when the code is available so it's widely distributed.

        Show
        Arun C Murthy added a comment - Great, thanks Raplh - really look forward to it! Also, you/Milind might want to send a note out to Hadoop lists (general@) when the code is available so it's widely distributed.
        Hide
        Ralph Castain added a comment -

        Hi Arun

        I should be committing it later tonight, or perhaps tomorrow. I'll send out a note once I do, including a pointer to the OMPI nightly snapshot link where the code can be obtained.

        Ralph

        Show
        Ralph Castain added a comment - Hi Arun I should be committing it later tonight, or perhaps tomorrow. I'll send out a note once I do, including a pointer to the OMPI nightly snapshot link where the code can be obtained. Ralph
        Hide
        Arun C Murthy added a comment -

        Ralph/Milind, this is great to hear!

        Do you have pointers to the code for me to play with? Thanks!

        Show
        Arun C Murthy added a comment - Ralph/Milind, this is great to hear! Do you have pointers to the code for me to play with? Thanks!
        Hide
        Milind Bhandarkar added a comment -

        I am excited to report that, thanks to great efforts by Ralph Castain and Wangda Tan, Hamster (i.e. OpenMPI on Yarn) now works flawlessly, and is scheduled to be merged to OpenMPI trunk soon. This effort was equivalent to building a second floor on a mobile home while it was hurtling down the freeway at 65 MPH Thanks to both Ralph & Wangda.

        According to Ralph:

        "Lots of cleanup and documentation to do, and performance sucks per HPC
        standards. But at least it works!"

        To my knowledge, this is the first application framework implemented in C that uses the multi-lingual protobuf APIs for Yarn. (For secure environments, a small java-based shim is needed.)

        Also, it is encouraging that no changes were needed in Yarn to make resource allocation work for MPI. (MPI as a standard came along in 1994, 18 years before Yarn was designed.)

        Currently, using MPI-IO functionality in MPI requires a shared posix file-system mounted on every node. However, this will change in future. For some distributed file systems (cough), which offer posix interface, MPI-IO works today.

        Once it is decided whether BigTop can include Non-ASF packages, we plan to work with BigTop community to integrate OpenMPI (new BSD-licensed) in the big data stack.

        I am closing this issue as fixed.

        Show
        Milind Bhandarkar added a comment - I am excited to report that, thanks to great efforts by Ralph Castain and Wangda Tan, Hamster (i.e. OpenMPI on Yarn) now works flawlessly, and is scheduled to be merged to OpenMPI trunk soon. This effort was equivalent to building a second floor on a mobile home while it was hurtling down the freeway at 65 MPH Thanks to both Ralph & Wangda. According to Ralph: "Lots of cleanup and documentation to do, and performance sucks per HPC standards. But at least it works!" To my knowledge, this is the first application framework implemented in C that uses the multi-lingual protobuf APIs for Yarn. (For secure environments, a small java-based shim is needed.) Also, it is encouraging that no changes were needed in Yarn to make resource allocation work for MPI. (MPI as a standard came along in 1994, 18 years before Yarn was designed.) Currently, using MPI-IO functionality in MPI requires a shared posix file-system mounted on every node. However, this will change in future. For some distributed file systems ( cough ), which offer posix interface, MPI-IO works today. Once it is decided whether BigTop can include Non-ASF packages, we plan to work with BigTop community to integrate OpenMPI (new BSD-licensed) in the big data stack. I am closing this issue as fixed.
        Hide
        Dinkar Sitaram added a comment -

        Hi Milind,

        I discussed this with some of the Hadoop contributors (Devaraj Das, Arun Murthy, and Sharad Agarwal), and they suggested that the original approach might be a good project to take forward with a couple of students (I am teaching as a visiting faculty at PESIT, one of the best CS institutes here in Bangalore). Could you please post any code that you have written? I could use that as a base to start.

        Regards,
        Dinkar

        Show
        Dinkar Sitaram added a comment - Hi Milind, I discussed this with some of the Hadoop contributors (Devaraj Das, Arun Murthy, and Sharad Agarwal), and they suggested that the original approach might be a good project to take forward with a couple of students (I am teaching as a visiting faculty at PESIT, one of the best CS institutes here in Bangalore). Could you please post any code that you have written? I could use that as a base to start. Regards, Dinkar
        Hide
        Ralph H Castain added a comment -

        Ah heck - fix the quotes. Sorry about that...

        Appreciate your thoughts. I think we generally agree on purpose, but maybe not so much on method. In my mind, bringing map-reduce to the MPI world in a first-class manner is a simpler, and ultimately more useful, solution as HPC clusters already exist and organizations know how to manage them. Certainly won't be true at places like Yahoo, but much of the business world has HPC systems (albeit of small size) in various depts.

        MPI jobs typically do run for long times, though surveys report that roughly 70% of them run for one hour or less (but more than a few minutes).

        Like I said, I'm sure it matters in the HPC world, but I'm pretty sure it's much less of an issue in the Hadoop world even as we work to improve container launch etc. As a result, I'm not very worried (at this stage) about this aspect for Hamster on YARN.

        I think this is largely an issue of scale. While Hadoop is deployed on large clusters, it seems to generally run in small jobs - i.e., a job consisting of 100 procs would be considered fairly large. In the HPC world, a 100 proc job is considered tiny. Those 1 hour jobs typically consist of hundreds to thousands of processes. Using the observed scaling behavior, Yarn would take the better part of an hour to launch and wireup an MPI job of that size.

        the requirement that we launch one container at a time (and copy binaries et al), will be difficult to overcome

        I'm not sure I follow. There is no requirement to launch one container at a time, how did that come up? An ApplicationMaster can, and should, launch multiple containers via threads etc. The MR AppMaster does that already.

        Yes/no. Unfortunately, MPI procs need to know who is collocated with them on a node at startup - otherwise, you make a major sacrifice in performance. So the AM cannot start launching until ALL nodes have been allocated. While it's true you can then make a threaded launcher to help reduce the time, it's still pretty much a linear launch pattern when you plot it out.

        the really big clusters do this with sophisticated network design (limiting the number of nodes served by each NFS server, multiple subnets) and TCP-over-IB protocols running at ~50Gbps, so systems connected via lesser networks can't compete

        That's fundamentally the reason Hadoop MapReduce Classic and YARN use HDFS to make the system more scalable/easy-to-manage with the other characteristics it brings along.

        Again, a question of perspective. The HPC world doesn't use the TCP networks for storing data files - only application programs. The data sits on parallel file systems that experiments have shown to be comparable or slightly faster than HDFS, and are easily managed. Again, HPC depts are familiar with these systems and know how to manage them.

        Keep in mind that HPC applications consume only kilobytes to a few megabytes of input data - but generate petabytes of output. So the problem is the inverse of what HDFS attempts to address. Where HDFS might come into play in HPC is therefore more in the visualization phase, where those petabytes are consumed to produce megabyte-sized images at video rates.

        The interest I've received has come more from the data analysis folks who have large amounts of data, but wish to utilize existing HPC clusters to analyze it with MR techniques. They may or may not use HDFS to do so - depends on the system administration.

        please feel free to consider adding node-to-node collective communications

        In the YARN design, the wireup is facilitate via the application's AppMaster. Thus each worker process just needs to share it's info with it's AppMaster for wireup. Again, the MR AM already does this for MR jobs.

        And therein lies the problem. Each worker process has to contact the AM, thus requiring a socket be opened and the required endpoint info sent to the AM, which must consume it. The AM must then send the aggregated info separately to each process. Result: quadratic scaling. Threading the AM helps a bit, but not a whole lot.

        To give you an example, consider again my benchmark system (3k-node, 12k-process, launch and wireup in 5sec). Using Yarn, the launch + wireup time computes to nearly an hour. We are very confident in those numbers because we have tested similar launch methods back in the days before we updated the OMPI architecture. In fact, we updated our architecture in response to those tests.

        So for small jobs, Yarn is fine. My concern is to provide an environment that can both run the small Hadoop MR jobs and support the larger HPC jobs, all on the same cluster. Creating a medium-performance version of MPI for Hadoop under Yarn is a fairly low priority, to be honest.

        Ralph

        Show
        Ralph H Castain added a comment - Ah heck - fix the quotes. Sorry about that... Appreciate your thoughts. I think we generally agree on purpose, but maybe not so much on method. In my mind, bringing map-reduce to the MPI world in a first-class manner is a simpler, and ultimately more useful, solution as HPC clusters already exist and organizations know how to manage them. Certainly won't be true at places like Yahoo, but much of the business world has HPC systems (albeit of small size) in various depts. MPI jobs typically do run for long times, though surveys report that roughly 70% of them run for one hour or less (but more than a few minutes). Like I said, I'm sure it matters in the HPC world, but I'm pretty sure it's much less of an issue in the Hadoop world even as we work to improve container launch etc. As a result, I'm not very worried (at this stage) about this aspect for Hamster on YARN. I think this is largely an issue of scale. While Hadoop is deployed on large clusters, it seems to generally run in small jobs - i.e., a job consisting of 100 procs would be considered fairly large. In the HPC world, a 100 proc job is considered tiny. Those 1 hour jobs typically consist of hundreds to thousands of processes. Using the observed scaling behavior, Yarn would take the better part of an hour to launch and wireup an MPI job of that size. the requirement that we launch one container at a time (and copy binaries et al), will be difficult to overcome I'm not sure I follow. There is no requirement to launch one container at a time, how did that come up? An ApplicationMaster can, and should, launch multiple containers via threads etc. The MR AppMaster does that already. Yes/no. Unfortunately, MPI procs need to know who is collocated with them on a node at startup - otherwise, you make a major sacrifice in performance. So the AM cannot start launching until ALL nodes have been allocated. While it's true you can then make a threaded launcher to help reduce the time, it's still pretty much a linear launch pattern when you plot it out. the really big clusters do this with sophisticated network design (limiting the number of nodes served by each NFS server, multiple subnets) and TCP-over-IB protocols running at ~50Gbps, so systems connected via lesser networks can't compete That's fundamentally the reason Hadoop MapReduce Classic and YARN use HDFS to make the system more scalable/easy-to-manage with the other characteristics it brings along. Again, a question of perspective. The HPC world doesn't use the TCP networks for storing data files - only application programs. The data sits on parallel file systems that experiments have shown to be comparable or slightly faster than HDFS, and are easily managed. Again, HPC depts are familiar with these systems and know how to manage them. Keep in mind that HPC applications consume only kilobytes to a few megabytes of input data - but generate petabytes of output. So the problem is the inverse of what HDFS attempts to address. Where HDFS might come into play in HPC is therefore more in the visualization phase, where those petabytes are consumed to produce megabyte-sized images at video rates. The interest I've received has come more from the data analysis folks who have large amounts of data, but wish to utilize existing HPC clusters to analyze it with MR techniques. They may or may not use HDFS to do so - depends on the system administration. please feel free to consider adding node-to-node collective communications In the YARN design, the wireup is facilitate via the application's AppMaster. Thus each worker process just needs to share it's info with it's AppMaster for wireup. Again, the MR AM already does this for MR jobs. And therein lies the problem. Each worker process has to contact the AM, thus requiring a socket be opened and the required endpoint info sent to the AM, which must consume it. The AM must then send the aggregated info separately to each process. Result: quadratic scaling. Threading the AM helps a bit, but not a whole lot. To give you an example, consider again my benchmark system (3k-node, 12k-process, launch and wireup in 5sec). Using Yarn, the launch + wireup time computes to nearly an hour. We are very confident in those numbers because we have tested similar launch methods back in the days before we updated the OMPI architecture. In fact, we updated our architecture in response to those tests. So for small jobs, Yarn is fine. My concern is to provide an environment that can both run the small Hadoop MR jobs and support the larger HPC jobs, all on the same cluster. Creating a medium-performance version of MPI for Hadoop under Yarn is a fairly low priority, to be honest. Ralph
        Hide
        Ralph H Castain added a comment -

        Hi Steve

        If you look at the UK university grid http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html you can see that although there are lots of clusters, they are of limited storage capacity -that storage also forces you to choose where to run the work or rely on job preheating to pull it in from RAL or elsewhere. (latency to do this is lower than pulling off tape). You can also see that there are lot of jobs in the queues, including short-lived health tests that verify work reaches the expected answers. I don't know about the duration/needs of the actual work jobs.

        When you consider job startup delays you have to look at time to fetch data over long-haul connections, maybe compile code for target cluster, and recognise that without a SAN you can't expect uniform access times to all data.

        A grid is very different from an HPC cluster, which are far more common (grids have been dying out over the last few years). We never see data pulled over long-haul connections - frankly, you don't see people doing it any more on grids either due to the unreliability and delays in delivery. HPC clusters are almost always homogeneous (I think I've seen two heterogeneous HPC clusters outside of a lab so far), and generally are backed by a parallel file system that actually does provide pretty uniform access times. Remember: MPI jobs use MPI-IO to fetch/write data, and they write a lot more data than they read (as per my prior note).

        Thus, once an allocation is given, there is no startup delay like you describe. There is some time required to load binaries and libs onto each node, but that scales well and goes very fast. As per my other note, we figured out how to solve that a while back.

        What you would get from MPI over hadoop is the ability to run MPI work on the cluster -a cluster which, if it also had infiniband on, would have low-latency interconnections. (yes, there is a cost for that, but you may want it for a shared cluster).

        Agreed - so long as the MPI job is small enough, it should work.

        What about an MPI mechanism that has a Grid Scheduler that block-rents a set of machines that an then be used for multiple jobs off the MPI queue, and which aren't released after each job? Once the capacity on the hosts is allocated, health checks can verify the machines work properly, then it can await work. The scheduler can look at the pending queue and flex its set of machines based on expected load?

        Job startup would be reduce to the time to push out work to the pre-allocated hosts, which doesn't need to rely on heartbeats and could use Zookeeper or other co-ordination services.

        This wouldn't be a drop in replacement for one of the big supercomputing clusters, but it would let people run MPI jobs within a Hadoop cluster.

        I'm not sure how that would work - I guess you would have to interface something like OGE/SGE to Yarn so that it could "rent" machines from Yarn? As Milind noted, that interface is non-trivial today. I've talked to the GE folks about it (as well as to the other major HPC RM orgs), but they don't have much interest in providing such a capability - they are far more interested in the reverse approach (i.e., running MR on an HPC cluster).

        Situation could change as time passes and the interface stabilizes/becomes easier.

        HTH
        Ralph

        Show
        Ralph H Castain added a comment - Hi Steve If you look at the UK university grid http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html you can see that although there are lots of clusters, they are of limited storage capacity -that storage also forces you to choose where to run the work or rely on job preheating to pull it in from RAL or elsewhere. (latency to do this is lower than pulling off tape). You can also see that there are lot of jobs in the queues, including short-lived health tests that verify work reaches the expected answers. I don't know about the duration/needs of the actual work jobs. When you consider job startup delays you have to look at time to fetch data over long-haul connections, maybe compile code for target cluster, and recognise that without a SAN you can't expect uniform access times to all data. A grid is very different from an HPC cluster, which are far more common (grids have been dying out over the last few years). We never see data pulled over long-haul connections - frankly, you don't see people doing it any more on grids either due to the unreliability and delays in delivery. HPC clusters are almost always homogeneous (I think I've seen two heterogeneous HPC clusters outside of a lab so far), and generally are backed by a parallel file system that actually does provide pretty uniform access times. Remember: MPI jobs use MPI-IO to fetch/write data, and they write a lot more data than they read (as per my prior note). Thus, once an allocation is given, there is no startup delay like you describe. There is some time required to load binaries and libs onto each node, but that scales well and goes very fast. As per my other note, we figured out how to solve that a while back. What you would get from MPI over hadoop is the ability to run MPI work on the cluster -a cluster which, if it also had infiniband on, would have low-latency interconnections. (yes, there is a cost for that, but you may want it for a shared cluster). Agreed - so long as the MPI job is small enough, it should work. What about an MPI mechanism that has a Grid Scheduler that block-rents a set of machines that an then be used for multiple jobs off the MPI queue, and which aren't released after each job? Once the capacity on the hosts is allocated, health checks can verify the machines work properly, then it can await work. The scheduler can look at the pending queue and flex its set of machines based on expected load? Job startup would be reduce to the time to push out work to the pre-allocated hosts, which doesn't need to rely on heartbeats and could use Zookeeper or other co-ordination services. This wouldn't be a drop in replacement for one of the big supercomputing clusters, but it would let people run MPI jobs within a Hadoop cluster. I'm not sure how that would work - I guess you would have to interface something like OGE/SGE to Yarn so that it could "rent" machines from Yarn? As Milind noted, that interface is non-trivial today. I've talked to the GE folks about it (as well as to the other major HPC RM orgs), but they don't have much interest in providing such a capability - they are far more interested in the reverse approach (i.e., running MR on an HPC cluster). Situation could change as time passes and the interface stabilizes/becomes easier. HTH Ralph
        Hide
        Ralph H Castain added a comment -

        Appreciate your thoughts. I think we generally agree on purpose, but maybe not so much on method. In my mind, bringing map-reduce to the MPI world in a first-class manner is a simpler, and ultimately more useful, solution as HPC clusters already exist and organizations know how to manage them. Certainly won't be true at places like Yahoo, but much of the business world has HPC systems (albeit of small size) in various depts.

        MPI jobs typically do run for long times, though surveys report that roughly 70% of them run for one hour or less (but more than a few minutes).

        Like I said, I'm sure it matters in the HPC world, but I'm pretty sure it's much less of an issue in the Hadoop world even as we work to improve container launch etc. As a result, I'm not very worried (at this stage) about this aspect for Hamster on YARN.

        I think this is largely an issue of scale. While Hadoop is deployed on large clusters, it seems to generally run in small jobs - i.e., a job consisting of 100 procs would be considered fairly large. In the HPC world, a 100 proc job is considered tiny. Those 1 hour jobs typically consist of hundreds to thousands of processes. Using the observed scaling behavior, Yarn would take the better part of an hour to launch and wireup an MPI job of that size.

        the requirement that we launch one container at a time (and copy binaries et al), will be difficult to overcome

        I'm not sure I follow. There is no requirement to launch one container at a time, how did that come up? An ApplicationMaster can, and should, launch multiple containers via threads etc. The MR AppMaster does that already.

        Yes/no. Unfortunately, MPI procs need to know who is collocated with them on a node at startup - otherwise, you make a major sacrifice in performance. So the AM cannot start launching until ALL nodes have been allocated. While it's true you can then make a threaded launcher to help reduce the time, it's still pretty much a linear launch pattern when you plot it out.

        {quote{
        the really big clusters do this with sophisticated network design (limiting the number of nodes served by each NFS server, multiple subnets) and TCP-over-IB protocols running at ~50Gbps, so systems connected via lesser networks can't compete

        That's fundamentally the reason Hadoop MapReduce Classic and YARN use HDFS to make the system more scalable/easy-to-manage with the other characteristics it brings along.

        Again, a question of perspective. The HPC world doesn't use the TCP networks for storing data files - only application programs. The data sits on parallel file systems that experiments have shown to be comparable or slightly faster than HDFS, and are easily managed. Again, HPC depts are familiar with these systems and know how to manage them.

        Keep in mind that HPC applications consume only kilobytes to a few megabytes of input data - but generate petabytes of output. So the problem is the inverse of what HDFS attempts to address. Where HDFS might come into play in HPC is therefore more in the visualization phase, where those petabytes are consumed to produce megabyte-sized images at video rates.

        The interest I've received has come more from the data analysis folks who have large amounts of data, but wish to utilize existing HPC clusters to analyze it with MR techniques. They may or may not use HDFS to do so - depends on the system administration.

        please feel free to consider adding node-to-node collective communications

        In the YARN design, the wireup is facilitate via the application's AppMaster. Thus each worker process just needs to share it's info with it's AppMaster for wireup. Again, the MR AM already does this for MR jobs.

        And therein lies the problem. Each worker process has to contact the AM, thus requiring a socket be opened and the required endpoint info sent to the AM, which must consume it. The AM must then send the aggregated info separately to each process. Result: quadratic scaling. Threading the AM helps a bit, but not a whole lot.

        To give you an example, consider again my benchmark system (3k-node, 12k-process, launch and wireup in 5sec). Using Yarn, the launch + wireup time computes to nearly an hour. We are very confident in those numbers because we have tested similar launch methods back in the days before we updated the OMPI architecture. In fact, we updated our architecture in response to those tests.

        So for small jobs, Yarn is fine. My concern is to provide an environment that can both run the small Hadoop MR jobs and support the larger HPC jobs, all on the same cluster. Creating a medium-performance version of MPI for Hadoop under Yarn is a fairly low priority, to be honest.

        Ralph

        Show
        Ralph H Castain added a comment - Appreciate your thoughts. I think we generally agree on purpose, but maybe not so much on method. In my mind, bringing map-reduce to the MPI world in a first-class manner is a simpler, and ultimately more useful, solution as HPC clusters already exist and organizations know how to manage them. Certainly won't be true at places like Yahoo, but much of the business world has HPC systems (albeit of small size) in various depts. MPI jobs typically do run for long times, though surveys report that roughly 70% of them run for one hour or less (but more than a few minutes). Like I said, I'm sure it matters in the HPC world, but I'm pretty sure it's much less of an issue in the Hadoop world even as we work to improve container launch etc. As a result, I'm not very worried (at this stage) about this aspect for Hamster on YARN. I think this is largely an issue of scale. While Hadoop is deployed on large clusters, it seems to generally run in small jobs - i.e., a job consisting of 100 procs would be considered fairly large. In the HPC world, a 100 proc job is considered tiny. Those 1 hour jobs typically consist of hundreds to thousands of processes. Using the observed scaling behavior, Yarn would take the better part of an hour to launch and wireup an MPI job of that size. the requirement that we launch one container at a time (and copy binaries et al), will be difficult to overcome I'm not sure I follow. There is no requirement to launch one container at a time, how did that come up? An ApplicationMaster can, and should, launch multiple containers via threads etc. The MR AppMaster does that already. Yes/no. Unfortunately, MPI procs need to know who is collocated with them on a node at startup - otherwise, you make a major sacrifice in performance. So the AM cannot start launching until ALL nodes have been allocated. While it's true you can then make a threaded launcher to help reduce the time, it's still pretty much a linear launch pattern when you plot it out. {quote{ the really big clusters do this with sophisticated network design (limiting the number of nodes served by each NFS server, multiple subnets) and TCP-over-IB protocols running at ~50Gbps, so systems connected via lesser networks can't compete That's fundamentally the reason Hadoop MapReduce Classic and YARN use HDFS to make the system more scalable/easy-to-manage with the other characteristics it brings along. Again, a question of perspective. The HPC world doesn't use the TCP networks for storing data files - only application programs. The data sits on parallel file systems that experiments have shown to be comparable or slightly faster than HDFS, and are easily managed. Again, HPC depts are familiar with these systems and know how to manage them. Keep in mind that HPC applications consume only kilobytes to a few megabytes of input data - but generate petabytes of output. So the problem is the inverse of what HDFS attempts to address. Where HDFS might come into play in HPC is therefore more in the visualization phase, where those petabytes are consumed to produce megabyte-sized images at video rates. The interest I've received has come more from the data analysis folks who have large amounts of data, but wish to utilize existing HPC clusters to analyze it with MR techniques. They may or may not use HDFS to do so - depends on the system administration. please feel free to consider adding node-to-node collective communications In the YARN design, the wireup is facilitate via the application's AppMaster. Thus each worker process just needs to share it's info with it's AppMaster for wireup. Again, the MR AM already does this for MR jobs. And therein lies the problem. Each worker process has to contact the AM, thus requiring a socket be opened and the required endpoint info sent to the AM, which must consume it. The AM must then send the aggregated info separately to each process. Result: quadratic scaling. Threading the AM helps a bit, but not a whole lot. To give you an example, consider again my benchmark system (3k-node, 12k-process, launch and wireup in 5sec). Using Yarn, the launch + wireup time computes to nearly an hour. We are very confident in those numbers because we have tested similar launch methods back in the days before we updated the OMPI architecture. In fact, we updated our architecture in response to those tests. So for small jobs, Yarn is fine. My concern is to provide an environment that can both run the small Hadoop MR jobs and support the larger HPC jobs, all on the same cluster. Creating a medium-performance version of MPI for Hadoop under Yarn is a fairly low priority, to be honest. Ralph
        Hide
        Steve Loughran added a comment -

        Classic HPC systems contain some expections that aren't present in normal Hadoop clusters

        1. low-latency IPC & APIs to access it for messaging
        2. shared location-independent filesystem (e.g. SAN, GPFS, ...)
        3. highly calibrated C++ compiler and library systems to guarantee consistent output of FPU- and GPU- intensive code.
        4. C/C++ code with lots of MPI code scattered through out for synchronization.
        5. either checkpointing of long-lived work or timespan limits for jobs.

        There are some other features that aren't necessarily design goals but have arisen

        1. lots of smaller per-facility sites with different architectures rather than one or two large-scale datacentres.
        2. limited local storage (due to cost of storage solution)
        3. a reliance on zero-costed internet-2/SuperJanet/... interconnect to pull data from shared repositories (for CERN, the national tier-1 sites such as Rutherford Appleton Laboratories)
        4. high operations cost relative to storage and compute capacity (due to small, heterogenous, multi-site clusters, selected storage solutions)

        If you look at the UK university grid http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html you can see that although there are lots of clusters, they are of limited storage capacity -that storage also forces you to choose where to run the work or rely on job preheating to pull it in from RAL or elsewhere. (latency to do this is lower than pulling off tape). You can also see that there are lot of jobs in the queues, including short-lived health tests that verify work reaches the expected answers. I don't know about the duration/needs of the actual work jobs.

        When you consider job startup delays you have to look at time to fetch data over long-haul connections, maybe compile code for target cluster, and recognise that without a SAN you can't expect uniform access times to all data.

        What you would get from MPI over hadoop is the ability to run MPI work on the cluster -a cluster which, if it also had infiniband on, would have low-latency interconnections. (yes, there is a cost for that, but you may want it for a shared cluster).

        What about an MPI mechanism that has a Grid Scheduler that block-rents a set of machines that an then be used for multiple jobs off the MPI queue, and which aren't released after each job? Once the capacity on the hosts is allocated, health checks can verify the machines work properly, then it can await work. The scheduler can look at the pending queue and flex its set of machines based on expected load?

        Job startup would be reduce to the time to push out work to the pre-allocated hosts, which doesn't need to rely on heartbeats and could use Zookeeper or other co-ordination services.

        This wouldn't be a drop in replacement for one of the big supercomputing clusters, but it would let people run MPI jobs within a Hadoop cluster.

        Show
        Steve Loughran added a comment - Classic HPC systems contain some expections that aren't present in normal Hadoop clusters low-latency IPC & APIs to access it for messaging shared location-independent filesystem (e.g. SAN, GPFS, ...) highly calibrated C++ compiler and library systems to guarantee consistent output of FPU- and GPU- intensive code. C/C++ code with lots of MPI code scattered through out for synchronization. either checkpointing of long-lived work or timespan limits for jobs. There are some other features that aren't necessarily design goals but have arisen lots of smaller per-facility sites with different architectures rather than one or two large-scale datacentres. limited local storage (due to cost of storage solution) a reliance on zero-costed internet-2/SuperJanet/... interconnect to pull data from shared repositories (for CERN, the national tier-1 sites such as Rutherford Appleton Laboratories) high operations cost relative to storage and compute capacity (due to small, heterogenous, multi-site clusters, selected storage solutions) If you look at the UK university grid http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html you can see that although there are lots of clusters, they are of limited storage capacity -that storage also forces you to choose where to run the work or rely on job preheating to pull it in from RAL or elsewhere. (latency to do this is lower than pulling off tape). You can also see that there are lot of jobs in the queues, including short-lived health tests that verify work reaches the expected answers. I don't know about the duration/needs of the actual work jobs. When you consider job startup delays you have to look at time to fetch data over long-haul connections, maybe compile code for target cluster, and recognise that without a SAN you can't expect uniform access times to all data. What you would get from MPI over hadoop is the ability to run MPI work on the cluster -a cluster which, if it also had infiniband on, would have low-latency interconnections. (yes, there is a cost for that, but you may want it for a shared cluster). What about an MPI mechanism that has a Grid Scheduler that block-rents a set of machines that an then be used for multiple jobs off the MPI queue, and which aren't released after each job? Once the capacity on the hosts is allocated, health checks can verify the machines work properly, then it can await work. The scheduler can look at the pending queue and flex its set of machines based on expected load? Job startup would be reduce to the time to push out work to the pre-allocated hosts, which doesn't need to rely on heartbeats and could use Zookeeper or other co-ordination services. This wouldn't be a drop in replacement for one of the big supercomputing clusters, but it would let people run MPI jobs within a Hadoop cluster.
        Hide
        Arun C Murthy added a comment -

        Ralph, thanks again for the discussion.

        If I wasn't clear before, I'd like to re-emphasize that I look at Hamster as a way to bring MPI apps to the Hadoop world in a first-class manner (as opposed to the hacks employed so far), not necessarily as a way to bring YARN to the MPI world to which I'm agnostic (too).

        I fully respect that the HPC world has it's own conventions and expectations which isn't really what YARN is geared to solve.

        As a result, with some engineering effort I'm confident Hamster could be a really useful addition to the Hadoop toolkit even with some of the current issues you pointed out.


        MPI jobs typically do run for long times, though surveys report that roughly 70% of them run for one hour or less (but more than a few minutes).

        Like I said, I'm sure it matters in the HPC world, but I'm pretty sure it's much less of an issue in the Hadoop world even as we work to improve container launch etc. As a result, I'm not very worried (at this stage) about this aspect for Hamster on YARN.

        Given your above stat, I'm even more confident Hamster should be fine presently.

        the requirement that we launch one container at a time (and copy binaries et al), will be difficult to overcome

        I'm not sure I follow. There is no requirement to launch one container at a time, how did that come up? An ApplicationMaster can, and should, launch multiple containers via threads etc. The MR AppMaster does that already.


        the really big clusters do this with sophisticated network design (limiting the number of nodes served by each NFS server, multiple subnets) and TCP-over-IB protocols running at ~50Gbps, so systems connected via lesser networks can't compete

        That's fundamentally the reason Hadoop MapReduce Classic and YARN use HDFS to make the system more scalable/easy-to-manage with the other characteristics it brings along.


        please feel free to consider adding node-to-node collective communications

        In the YARN design, the wireup is facilitate via the application's AppMaster. Thus each worker process just needs to share it's info with it's AppMaster for wireup. Again, the MR AM already does this for MR jobs.


        Overall, as disappointed I am, I can understand that your priorities lie elsewhere.

        OTOH, I'd really appreciate if you could share your still-born code or even a prototype. I'm pretty sure there will be volunteers to take it forward since I know of sufficient interest in running MPI on YARN. Thanks!

        Show
        Arun C Murthy added a comment - Ralph, thanks again for the discussion. If I wasn't clear before, I'd like to re-emphasize that I look at Hamster as a way to bring MPI apps to the Hadoop world in a first-class manner (as opposed to the hacks employed so far), not necessarily as a way to bring YARN to the MPI world to which I'm agnostic (too). I fully respect that the HPC world has it's own conventions and expectations which isn't really what YARN is geared to solve. As a result, with some engineering effort I'm confident Hamster could be a really useful addition to the Hadoop toolkit even with some of the current issues you pointed out. MPI jobs typically do run for long times, though surveys report that roughly 70% of them run for one hour or less (but more than a few minutes). Like I said, I'm sure it matters in the HPC world, but I'm pretty sure it's much less of an issue in the Hadoop world even as we work to improve container launch etc. As a result, I'm not very worried (at this stage) about this aspect for Hamster on YARN. Given your above stat, I'm even more confident Hamster should be fine presently. the requirement that we launch one container at a time (and copy binaries et al), will be difficult to overcome I'm not sure I follow. There is no requirement to launch one container at a time, how did that come up? An ApplicationMaster can, and should, launch multiple containers via threads etc. The MR AppMaster does that already. the really big clusters do this with sophisticated network design (limiting the number of nodes served by each NFS server, multiple subnets) and TCP-over-IB protocols running at ~50Gbps, so systems connected via lesser networks can't compete That's fundamentally the reason Hadoop MapReduce Classic and YARN use HDFS to make the system more scalable/easy-to-manage with the other characteristics it brings along. please feel free to consider adding node-to-node collective communications In the YARN design, the wireup is facilitate via the application's AppMaster. Thus each worker process just needs to share it's info with it's AppMaster for wireup. Again, the MR AM already does this for MR jobs. Overall, as disappointed I am, I can understand that your priorities lie elsewhere. OTOH, I'd really appreciate if you could share your still-born code or even a prototype. I'm pretty sure there will be volunteers to take it forward since I know of sufficient interest in running MPI on YARN. Thanks!
        Hide
        Ralph H Castain added a comment -

        Hi Arun

        No offense taken at all - hopefully both directions. I'm pretty agnostic on these things after working with them all over the years. Every one has its pros/cons.

        I'll try to address your points in sequence. I'd be happy to drop by some time (I'm in SF frequently these days) and chat about it (higher bandwidth is often helpful).

        ------------------------------------------
        First, people on HPC clusters also sit for some time in the queue waiting for resources. It is a very odd day when you can just jump onto a system! However, launch time is still a critical issue on such clusters since they typically operate at scale. Seconds add up over time.

        MPI jobs typically do run for long times, though surveys report that roughly 70% of them run for one hour or less (but more than a few minutes). Although we might agree that a few seconds shouldn't matter, the fact is that users will loudly protest such delays, especially when many of the MPI jobs are actually run interactively during development. Startup time is a very sensitive issue in HPC - my back has the scars to prove it.

        Without changing the basic design of Yarn, I don't see how Yarn can ever really compete in this area. The fact that the RM has to wait to be called via heartbeat by the nodes in order to do the allocation is a bottleneck, and the requirement that we launch one container at a time (and copy binaries et al), will be difficult to overcome.

        --------------------------------------

        HPC systems are always supported by NFS mounts of home directories as well as system libraries - we never use local disks due to the management headache they create. User requirements for different library versions are handled via the "modules" system. We have found ways to make this scalable to over 100k nodes without significant startup time. Of course, the really big clusters do this with sophisticated network design (limiting the number of nodes served by each NFS server, multiple subnets) and TCP-over-IB protocols running at ~50Gbps, so systems connected via lesser networks can't compete. On the other hand, smaller clusters do quite well with appropriately laid out Ethernet support.

        Thus, we never pre-distribute binaries, even on the largest clusters (jars aren't an issue as you have to look hard to find any Java applications in HPC). We also never build static libraries as memory is at a premium for HPC jobs, so everything is dynamically linked. On the larger systems, there sometimes are bandwidth restrictions on the NFS channels - in these cases, we disable dlopen so that the MPI libraries are rolled up into one dynamic lib to minimize the number of NFS calls. The launch example I gave was from such a configuration.

        --------------------------------------

        Clearly, Yarn could provide collective communication support if it chose to do so. However, it would result in a fundamental change in the Yarn architecture.

        In the HPC world, the RMs operate in one of two modes. Some, like SLURM, provide their own wireup support. In these cases, each process provides its endpoint information to the local RM daemon. Those daemons in turn share that info across all other daemons hosting processes from that job using collective operations designed for large scale. Application processes are then either passed the resulting info (assembled from all procs in the job), or can query the local RM daemon to obtain the pieces they need.

        Some RMs, like Torque, do not offer these services. In those cases, the MPI implementation itself provides it by first launching its own daemons on each node in the job. These daemons then assume the role played by the above SLURM daemon, circulating the endpoint info using similar collective operations. While it might seem that this has to take longer due to the additional daemon launch, it actually is competitive as the startup time (given the scalable launch) for the daemons is really quite small, and the time required to share info across the procs (called the "modex" operation) is much longer.

        Translating these methods to Yarn would require that the Nodemanagers learn how to communicate with each other, which seems to me to break the Yarn design. I admit I could be wrong here, so please feel free to consider adding node-to-node collective communications. It isn't terribly hard to do, but it does take some work to make it robust.

        ---------------------------------------

        I have looked at HoD and can understand the problems. The mistake (IMHO) was to attempt to run it directly under Torque. This has its limitations. We avoid those by actually using an adaptation of OMPI's mpirun tool, which eliminates most of the problems. So far, it seems to be working quite well, though there is more to be done and always unforeseen problems to resolve.

        ---------------------------------------

        I suspect there are folks out there who will use Yarn and have interest in using MPI. However, my assignment is to focus on creating the ability to dual-use clusters as this has become a limiting issue, so that's where my emphasis currently lies. Once that is complete, and assuming the powers-that-be agree, I will take another look at it. Perhaps by then the pain will have lessened - frankly, I find Yarn to be very difficult to work with at the moment.

        HTH
        Ralph

        Show
        Ralph H Castain added a comment - Hi Arun No offense taken at all - hopefully both directions. I'm pretty agnostic on these things after working with them all over the years. Every one has its pros/cons. I'll try to address your points in sequence. I'd be happy to drop by some time (I'm in SF frequently these days) and chat about it (higher bandwidth is often helpful). ------------------------------------------ First, people on HPC clusters also sit for some time in the queue waiting for resources. It is a very odd day when you can just jump onto a system! However, launch time is still a critical issue on such clusters since they typically operate at scale. Seconds add up over time. MPI jobs typically do run for long times, though surveys report that roughly 70% of them run for one hour or less (but more than a few minutes). Although we might agree that a few seconds shouldn't matter, the fact is that users will loudly protest such delays, especially when many of the MPI jobs are actually run interactively during development. Startup time is a very sensitive issue in HPC - my back has the scars to prove it. Without changing the basic design of Yarn, I don't see how Yarn can ever really compete in this area. The fact that the RM has to wait to be called via heartbeat by the nodes in order to do the allocation is a bottleneck, and the requirement that we launch one container at a time (and copy binaries et al), will be difficult to overcome. -------------------------------------- HPC systems are always supported by NFS mounts of home directories as well as system libraries - we never use local disks due to the management headache they create. User requirements for different library versions are handled via the "modules" system. We have found ways to make this scalable to over 100k nodes without significant startup time. Of course, the really big clusters do this with sophisticated network design (limiting the number of nodes served by each NFS server, multiple subnets) and TCP-over-IB protocols running at ~50Gbps, so systems connected via lesser networks can't compete. On the other hand, smaller clusters do quite well with appropriately laid out Ethernet support. Thus, we never pre-distribute binaries, even on the largest clusters (jars aren't an issue as you have to look hard to find any Java applications in HPC). We also never build static libraries as memory is at a premium for HPC jobs, so everything is dynamically linked. On the larger systems, there sometimes are bandwidth restrictions on the NFS channels - in these cases, we disable dlopen so that the MPI libraries are rolled up into one dynamic lib to minimize the number of NFS calls. The launch example I gave was from such a configuration. -------------------------------------- Clearly, Yarn could provide collective communication support if it chose to do so. However, it would result in a fundamental change in the Yarn architecture. In the HPC world, the RMs operate in one of two modes. Some, like SLURM, provide their own wireup support. In these cases, each process provides its endpoint information to the local RM daemon. Those daemons in turn share that info across all other daemons hosting processes from that job using collective operations designed for large scale. Application processes are then either passed the resulting info (assembled from all procs in the job), or can query the local RM daemon to obtain the pieces they need. Some RMs, like Torque, do not offer these services. In those cases, the MPI implementation itself provides it by first launching its own daemons on each node in the job. These daemons then assume the role played by the above SLURM daemon, circulating the endpoint info using similar collective operations. While it might seem that this has to take longer due to the additional daemon launch, it actually is competitive as the startup time (given the scalable launch) for the daemons is really quite small, and the time required to share info across the procs (called the "modex" operation) is much longer. Translating these methods to Yarn would require that the Nodemanagers learn how to communicate with each other, which seems to me to break the Yarn design. I admit I could be wrong here, so please feel free to consider adding node-to-node collective communications. It isn't terribly hard to do, but it does take some work to make it robust. --------------------------------------- I have looked at HoD and can understand the problems. The mistake (IMHO) was to attempt to run it directly under Torque. This has its limitations. We avoid those by actually using an adaptation of OMPI's mpirun tool, which eliminates most of the problems. So far, it seems to be working quite well, though there is more to be done and always unforeseen problems to resolve. --------------------------------------- I suspect there are folks out there who will use Yarn and have interest in using MPI. However, my assignment is to focus on creating the ability to dual-use clusters as this has become a limiting issue, so that's where my emphasis currently lies. Once that is complete, and assuming the powers-that-be agree, I will take another look at it. Perhaps by then the pain will have lessened - frankly, I find Yarn to be very difficult to work with at the moment. HTH Ralph
        Hide
        Milind Bhandarkar added a comment -

        One major issue that I faced early on in my Hamster prototype, is that hadoop RPC is still not completely java-independent, even with protobufs, especially when authentication comes into play. Thus, native C clients cannot be used with Yarn RM. This is one of the major roadblocks that Ralph faced as well, and had to go through a lot of trouble to get around that. Arun, what is the roadmap for that ?

        Show
        Milind Bhandarkar added a comment - One major issue that I faced early on in my Hamster prototype, is that hadoop RPC is still not completely java-independent, even with protobufs, especially when authentication comes into play. Thus, native C clients cannot be used with Yarn RM. This is one of the major roadblocks that Ralph faced as well, and had to go through a lot of trouble to get around that. Arun, what is the roadmap for that ?
        Hide
        Milind Bhandarkar added a comment -

        I mostly agree with Arun (except for HoD part). Having a low-latency communication mechanism in Hadoop world is valuable. Applications on this platform will not be typical HPC applications, but mostly iterative machine-learning applications that require crunching more input data, and produce models (small output data). Many are using sub-optimal ways of launching iterative computations on Hadoop anyway, and having even an inefficient MPI launch will be more efficient than what people resort to today.

        Arun, HoD is now viable. More later

        Show
        Milind Bhandarkar added a comment - I mostly agree with Arun (except for HoD part). Having a low-latency communication mechanism in Hadoop world is valuable. Applications on this platform will not be typical HPC applications, but mostly iterative machine-learning applications that require crunching more input data, and produce models (small output data). Many are using sub-optimal ways of launching iterative computations on Hadoop anyway, and having even an inefficient MPI launch will be more efficient than what people resort to today. Arun, HoD is now viable. More later
        Hide
        Arun C Murthy added a comment -

        Ralph, thanks for the heads up. It's disappointing it's some way away...

        Regarding your observations, but no offense taken, some clarifications:

        Foremost, there are a couple of reasons why the container launch on YARN should be more than sufficient:

        1. People using MPI on Hadoop are generally running very heavy computation models which run for several minutes at least if not for several hours. In fact, I've seen lots of MPI-like jobs run on Hadoop MapReduce which run for several days! In such cases a few seconds is really pure noise. (Also, we've added several features to CapacityScheduler to actually support such apps in hadoop-1.x.)
        2. Hadoop clusters are also, typically, heavily over-subscribed which means tasks (containers) spend a lot of time in queue waiting for resources to be available.

        Having said that, there is no doubt we are focussed on improving YARN. It's currently significantly faster than MapReduce Classic (JobTracker/TaskTracker) in hadoop-1.x for container (task) launch etc. and we have more improvements up our sleeve for the future.


        A very important factor for container launch in YARN (as with current MR) is that the system itself takes responsibility for 'setting up' the container by copying it's jars/shared-objects/binaries from HDFS. An easy way to speed up container launch is to pre-distributed MPI binaries on all nodes to do a perf-comparision. Is this something you've already tried?

        In contrast you probably, already, did that on Torque/Moab/Maui since they have to be distributed before the MPI 'job' is launched?


        Regarding 'collective communication support' can you please add more colour?

        With my naive understanding of the MPI world, I don't see why the same mechanisms can't be used in YARN.

        YARN doesn't mandate application communication mechanisms and just leaves it up to the applications. So, I fail to see why this comes up for MPI applications - happy to be educated! smile


        Regarding your current area of focus i.e. running Hadoop MapReduce (and HDFS?) on Torque etc., this is something we in the Hadoop community have already tried (and abandoned) with "Hadoop On Demand" i.e. HoD system (HoD jira: HADOOP-1301. HoD removed via HADOOP-7137). There are several reasons why we went away, and too much water under the bridge.


        Overall, I'm sure there are several people in Hadoop community who would love to use MPI on Hadoop/YARN, they already do that by bending Hadoop MapReduce in myriad & scary ways! I'm sure Milind, like me, has seen several examples! smile

        I can see why some of the issues you've brought up (milisec latency etc.) matter to folks in MPI world, in some ways it's a clash of the Hadoop and HPC worlds and the fallout is reasonable & expected! smile

        However, I still believe Hamster, as envisaged here, has a lot of value to the Hadoop community, if you choose to focus on it. Thanks!

        Show
        Arun C Murthy added a comment - Ralph, thanks for the heads up. It's disappointing it's some way away... Regarding your observations, but no offense taken, some clarifications: Foremost, there are a couple of reasons why the container launch on YARN should be more than sufficient: People using MPI on Hadoop are generally running very heavy computation models which run for several minutes at least if not for several hours. In fact, I've seen lots of MPI-like jobs run on Hadoop MapReduce which run for several days! In such cases a few seconds is really pure noise. (Also, we've added several features to CapacityScheduler to actually support such apps in hadoop-1.x.) Hadoop clusters are also, typically, heavily over-subscribed which means tasks (containers) spend a lot of time in queue waiting for resources to be available. Having said that, there is no doubt we are focussed on improving YARN. It's currently significantly faster than MapReduce Classic (JobTracker/TaskTracker) in hadoop-1.x for container (task) launch etc. and we have more improvements up our sleeve for the future. A very important factor for container launch in YARN (as with current MR) is that the system itself takes responsibility for 'setting up' the container by copying it's jars/shared-objects/binaries from HDFS. An easy way to speed up container launch is to pre-distributed MPI binaries on all nodes to do a perf-comparision. Is this something you've already tried? In contrast you probably, already, did that on Torque/Moab/Maui since they have to be distributed before the MPI 'job' is launched? Regarding 'collective communication support' can you please add more colour? With my naive understanding of the MPI world, I don't see why the same mechanisms can't be used in YARN. YARN doesn't mandate application communication mechanisms and just leaves it up to the applications. So, I fail to see why this comes up for MPI applications - happy to be educated! smile Regarding your current area of focus i.e. running Hadoop MapReduce (and HDFS?) on Torque etc., this is something we in the Hadoop community have already tried (and abandoned) with "Hadoop On Demand" i.e. HoD system (HoD jira: HADOOP-1301 . HoD removed via HADOOP-7137 ). There are several reasons why we went away, and too much water under the bridge. Overall, I'm sure there are several people in Hadoop community who would love to use MPI on Hadoop/YARN, they already do that by bending Hadoop MapReduce in myriad & scary ways! I'm sure Milind, like me, has seen several examples! smile I can see why some of the issues you've brought up (milisec latency etc.) matter to folks in MPI world, in some ways it's a clash of the Hadoop and HPC worlds and the fallout is reasonable & expected! smile However, I still believe Hamster, as envisaged here, has a lot of value to the Hadoop community, if you choose to focus on it. Thanks!
        Hide
        Milind Bhandarkar added a comment -

        Thanks for the summary & status, Ralph. It is true that not a lot of Hadoop users so far are concerned with the job startup time, given that the job typically runs for several minutes. (In my experience, the hadoop clusters are heavily oversubscribed, and job might even be pending in the queue for several minutes before even the first task is started.)

        So, I think a slow wireup should be acceptable to the hadoop community, and there is a value in having hamster use yarn. (Since your second approach of plugging in a HPC RM can remain modularly separate, one can have both implementations, increasing the choice for users, which is good, IMHO.)

        Show
        Milind Bhandarkar added a comment - Thanks for the summary & status, Ralph. It is true that not a lot of Hadoop users so far are concerned with the job startup time, given that the job typically runs for several minutes. (In my experience, the hadoop clusters are heavily oversubscribed, and job might even be pending in the queue for several minutes before even the first task is started.) So, I think a slow wireup should be acceptable to the hadoop community, and there is a value in having hamster use yarn. (Since your second approach of plugging in a HPC RM can remain modularly separate, one can have both implementations, increasing the choice for users, which is good, IMHO.)
        Hide
        Ralph H Castain added a comment -

        I'm afraid our optimism about a near-term patch proved a little too hopeful. Milind's initial prototype (based on advice from me, before I really understood the situation) won't work on multi-tenant systems, so another method had to be developed. After spending a couple of months beating on this, I've pretty much put it on hold for now as I pursue an alternative approach.

        Adding MPI support to Yarn has proven to be very difficult, and may not be worth the pain. The problems stem from some basic Yarn architectural decisions that run counter to HPC standards. For example, the linear launch pattern creates scaling behaviors that are objectionable to HPC users, who generally consider anything less than logarithmic to be unacceptable. All HPC RMs meet that requirement, with many of them scaling at better than logarithmic levels. The result is striking: on a 64-node launch, Yarn will take several seconds to start the job - whereas an HPC RM will start the same job in milliseconds.

        Similarly, the lack of collective communication support in Yarn means that MPI wireup scales quadratically under Yarn with the number of processes. Contrast this with a typical HPC installation where wireup scales logarithmically, and remember that wireup is the largest time consumer during MPI startup, and you can understand the concern. As a benchmark, we routinely start a 3k-node, 12k-process job (including MPI wireup) in about 5 seconds using Moab/Torque.

        Finally, when we compared fault tolerance performance, we didn't see a significant difference between Yarn and the latest HPC RM releases. Both exhibited similar recovery behaviors, and had similar multi-failure race condition issues.

        Just to be clear, I'm not criticizing Yarn - the other RMs had similar behaviors at a corresponding point in their development (including my own past efforts in that arena!). Remember, today's HPC RMs each can boast of roughly 50-100 man-years of development behind them, and have undergone several cycles of architectural change to improve scalability, so one would naturally expect them to out-perform Yarn at this point.

        There are other issues, including the difficulty of getting a Yarn AM to actually work. However, the impetus behind the "hold" really was the above observations, combined with an overwhelmingly negative reaction from the HPC community when I asked about using Yarn on general purpose (Hadoop + non-Hadoop apps) clusters. In contrast, I received a correspondingly positive reaction to the idea of running Hadoop MR on a general purpose cluster, separating that code (plus HDFS) from Yarn.

        I have therefore started pursuing this option, which proved to be much easier to do. I expect to have an "early adopter" version of MR/HDFS on an HPC cluster sometime in the next week or two, with a general release this summer (aided by other members of the HPC community who have volunteered their help).

        Of course, I realize that there will be people out there that decide to run Yarn on their Hadoop systems (as opposed to worrying about general purpose clusters), and that they might also be interested in using MPI. So I'll return to this after I get the HPC problem solved, with the caveat that such users understand that the scaling and performance will not be what they are used to seeing on non-Yarn systems.

        HTH
        Ralph

        Show
        Ralph H Castain added a comment - I'm afraid our optimism about a near-term patch proved a little too hopeful. Milind's initial prototype (based on advice from me, before I really understood the situation) won't work on multi-tenant systems, so another method had to be developed. After spending a couple of months beating on this, I've pretty much put it on hold for now as I pursue an alternative approach. Adding MPI support to Yarn has proven to be very difficult, and may not be worth the pain. The problems stem from some basic Yarn architectural decisions that run counter to HPC standards. For example, the linear launch pattern creates scaling behaviors that are objectionable to HPC users, who generally consider anything less than logarithmic to be unacceptable. All HPC RMs meet that requirement, with many of them scaling at better than logarithmic levels. The result is striking: on a 64-node launch, Yarn will take several seconds to start the job - whereas an HPC RM will start the same job in milliseconds. Similarly, the lack of collective communication support in Yarn means that MPI wireup scales quadratically under Yarn with the number of processes. Contrast this with a typical HPC installation where wireup scales logarithmically, and remember that wireup is the largest time consumer during MPI startup, and you can understand the concern. As a benchmark, we routinely start a 3k-node, 12k-process job (including MPI wireup) in about 5 seconds using Moab/Torque. Finally, when we compared fault tolerance performance, we didn't see a significant difference between Yarn and the latest HPC RM releases. Both exhibited similar recovery behaviors, and had similar multi-failure race condition issues. Just to be clear, I'm not criticizing Yarn - the other RMs had similar behaviors at a corresponding point in their development (including my own past efforts in that arena!). Remember, today's HPC RMs each can boast of roughly 50-100 man-years of development behind them, and have undergone several cycles of architectural change to improve scalability, so one would naturally expect them to out-perform Yarn at this point. There are other issues, including the difficulty of getting a Yarn AM to actually work. However, the impetus behind the "hold" really was the above observations, combined with an overwhelmingly negative reaction from the HPC community when I asked about using Yarn on general purpose (Hadoop + non-Hadoop apps) clusters. In contrast, I received a correspondingly positive reaction to the idea of running Hadoop MR on a general purpose cluster, separating that code (plus HDFS) from Yarn. I have therefore started pursuing this option, which proved to be much easier to do. I expect to have an "early adopter" version of MR/HDFS on an HPC cluster sometime in the next week or two, with a general release this summer (aided by other members of the HPC community who have volunteered their help). Of course, I realize that there will be people out there that decide to run Yarn on their Hadoop systems (as opposed to worrying about general purpose clusters), and that they might also be interested in using MPI. So I'll return to this after I get the HPC problem solved, with the caveat that such users understand that the scaling and performance will not be what they are used to seeing on non-Yarn systems. HTH Ralph
        Hide
        Arun C Murthy added a comment -

        Milind/Ralph - Any update on Hamster? Tx.

        Show
        Arun C Murthy added a comment - Milind/Ralph - Any update on Hamster? Tx.
        Hide
        Milind Bhandarkar added a comment -

        Ralph is taken this project over from me, and is very close to a patch.

        Show
        Milind Bhandarkar added a comment - Ralph is taken this project over from me, and is very close to a patch.
        Hide
        Jon Bringhurst added a comment -

        Although it seems like this project is very interesting, I'd like to draw some attention to Apache Mesos.

        http://www.mesosproject.org/

        It allows one to run several different versions of Hadoop, Torque, and MPI on the same cluster at the same time. For example, a map-reduce job can be running on the same cluster at the same time an MPI job is.

        Show
        Jon Bringhurst added a comment - Although it seems like this project is very interesting, I'd like to draw some attention to Apache Mesos. http://www.mesosproject.org/ It allows one to run several different versions of Hadoop, Torque, and MPI on the same cluster at the same time. For example, a map-reduce job can be running on the same cluster at the same time an MPI job is.
        Hide
        Forest Tan added a comment -

        This is a very helpful feature for scientific algorithms, hope it will release soon

        Show
        Forest Tan added a comment - This is a very helpful feature for scientific algorithms, hope it will release soon
        Hide
        Ralph Castain added a comment -

        Ah - my bad. I didn't realize I was looking at the comments in reverse order

        After reading the comments in the correct order, I now better understand the thread and see that Milind is following what I had suggested. As to the discussion of secure communications, this is a continuing issue in the MPI community. The problem is that securing at the message level creates considerable overhead and severely impacts MPI performance.

        What the community has chosen to do is secure at the user level, and then check socket connections to ensure we are talking to someone from within our own application. Thus, we launch based on ssh-like authentication requirements. During MPI_Init, we wireup socket connections. As each connection is made, we exchange an initial "ident" message that checks to ensure that the process on the other end is a member of our application. If it isn't, we drop the connection.

        If you want to add further security during the socket formation phase, nobody will object - though we might put it on a configuration basis so others aren't impacted as it will slow down launch times on very large clusters.

        HTH
        Ralph

        Show
        Ralph Castain added a comment - Ah - my bad. I didn't realize I was looking at the comments in reverse order After reading the comments in the correct order, I now better understand the thread and see that Milind is following what I had suggested. As to the discussion of secure communications, this is a continuing issue in the MPI community. The problem is that securing at the message level creates considerable overhead and severely impacts MPI performance. What the community has chosen to do is secure at the user level, and then check socket connections to ensure we are talking to someone from within our own application. Thus, we launch based on ssh-like authentication requirements. During MPI_Init, we wireup socket connections. As each connection is made, we exchange an initial "ident" message that checks to ensure that the process on the other end is a member of our application. If it isn't, we drop the connection. If you want to add further security during the socket formation phase, nobody will object - though we might put it on a configuration basis so others aren't impacted as it will slow down launch times on very large clusters. HTH Ralph
        Hide
        Ralph Castain added a comment -

        Let me preface my comment by confessing my current ignorance of Hadoop. I'm working on rectifying that situation, but won't claim to be anywhere close to fully understanding it.

        That said, I'm wondering if it is possible to simply run the MPI processes as standard Hadoop processes? I confess this was my initial thought. Rather than creating a cluster and using mpirun, just have the user start a standard Hadoop job - but with the processes being part of an overall MPI application. Thus, the processes would all call MPI_Init, execute as an MPI application, call MPI_Finalize, and then exit. If a user wants to integrate that application with MapReduce, more power to them - I can see some cases where that would be of interest.

        My point here is that you don't need mpirun at all, nor do you need all the overhead of running OMPI daemons. The Hadoop daemons can start and monitor the state of health of the MPI processes just fine. We might add some capability to the Hadoop daemons to assist (e.g., binding), but those would be of use regardless of whether or not the process is part of an MPI application.

        As I said, please forgive the ignorance if my suggestion makes no sense.

        Show
        Ralph Castain added a comment - Let me preface my comment by confessing my current ignorance of Hadoop. I'm working on rectifying that situation, but won't claim to be anywhere close to fully understanding it. That said, I'm wondering if it is possible to simply run the MPI processes as standard Hadoop processes? I confess this was my initial thought. Rather than creating a cluster and using mpirun, just have the user start a standard Hadoop job - but with the processes being part of an overall MPI application. Thus, the processes would all call MPI_Init, execute as an MPI application, call MPI_Finalize, and then exit. If a user wants to integrate that application with MapReduce, more power to them - I can see some cases where that would be of interest. My point here is that you don't need mpirun at all, nor do you need all the overhead of running OMPI daemons. The Hadoop daemons can start and monitor the state of health of the MPI processes just fine. We might add some capability to the Hadoop daemons to assist (e.g., binding), but those would be of use regardless of whether or not the process is part of an MPI application. As I said, please forgive the ignorance if my suggestion makes no sense.
        Hide
        Milind Bhandarkar added a comment -

        Sorry folks. I got distracted this week by some mind-numbing non-technical stuff. Progress on hamster was slow, as a result. Since I will be travelling next week, hoping to find some time to work on it

        Show
        Milind Bhandarkar added a comment - Sorry folks. I got distracted this week by some mind-numbing non-technical stuff. Progress on hamster was slow, as a result. Since I will be travelling next week, hoping to find some time to work on it
        Hide
        Luke Lu added a comment -

        the problem with mpiexec is it's license:

        "Mpiexec is free software and is licensed for use under the GNU General Public License, version 2."

        OpenMPI, on the other hand, is BSD-licensed, and implements the MPI-2 standard.

        You're confusing mpiexec the software from OSC with mpiexec the standard specified in MPI standard (MPI-2.2 section 8). OpenMPI includes an mpiexec executable, as many other implementations do.

        what they do in the user-code is equivalent to what users can do in map-reduce code, so I do not see an issue here.

        They're not equivalent, as normal mapreduce tasks's communication cannot be attack by other users. The Open MPI implementation is somewhat equivalent to a user implementing extra insecure protocols in map/reduce tasks in addition to the standard secure hadoop mapreduce protocols.

        Show
        Luke Lu added a comment - the problem with mpiexec is it's license: "Mpiexec is free software and is licensed for use under the GNU General Public License, version 2." OpenMPI, on the other hand, is BSD-licensed, and implements the MPI-2 standard. You're confusing mpiexec the software from OSC with mpiexec the standard specified in MPI standard (MPI-2.2 section 8). OpenMPI includes an mpiexec executable, as many other implementations do. what they do in the user-code is equivalent to what users can do in map-reduce code, so I do not see an issue here. They're not equivalent, as normal mapreduce tasks's communication cannot be attack by other users. The Open MPI implementation is somewhat equivalent to a user implementing extra insecure protocols in map/reduce tasks in addition to the standard secure hadoop mapreduce protocols.
        Hide
        Milind Bhandarkar added a comment -

        @Luke, the problem with mpiexec is it's license:

        "Mpiexec is free software and is licensed for use under the GNU General Public License, version 2."

        OpenMPI, on the other hand, is BSD-licensed, and implements the MPI-2 standard.

        Re: security, in this implementation, the MPI processes are launched with authentication, they run as the submitting user, and what they do in the user-code is equivalent to what users can do in map-reduce code, so I do not see an issue here.

        Show
        Milind Bhandarkar added a comment - @Luke, the problem with mpiexec is it's license: "Mpiexec is free software and is licensed for use under the GNU General Public License, version 2." OpenMPI, on the other hand, is BSD-licensed, and implements the MPI-2 standard. Re: security, in this implementation, the MPI processes are launched with authentication, they run as the submitting user, and what they do in the user-code is equivalent to what users can do in map-reduce code, so I do not see an issue here.
        Hide
        Luke Lu added a comment -

        how do you prevent map tasks opening sockets, receiving connections, and communicating with each other in Hadoop ? Isn't that the same case here ?

        Map tasks that do above is not secure, that's why we have the hadoop security which authenticates all hadoop protocols, so that normal map/reduce tasks are secure. By making the jira work with the MPI standard and making the non-standard portion configurable, you allow users to pick a secure MPI implementation, such as ES-MPICH2.

        Show
        Luke Lu added a comment - how do you prevent map tasks opening sockets, receiving connections, and communicating with each other in Hadoop ? Isn't that the same case here ? Map tasks that do above is not secure, that's why we have the hadoop security which authenticates all hadoop protocols, so that normal map/reduce tasks are secure. By making the jira work with the MPI standard and making the non-standard portion configurable, you allow users to pick a secure MPI implementation, such as ES-MPICH2.
        Hide
        Milind Bhandarkar added a comment -

        @Luke, how do you prevent map tasks opening sockets, receiving connections, and communicating with each other in Hadoop ? Isn't that the same case here ?

        Show
        Milind Bhandarkar added a comment - @Luke, how do you prevent map tasks opening sockets, receiving connections, and communicating with each other in Hadoop ? Isn't that the same case here ?
        Hide
        Luke Lu added a comment -

        mpiexec is used by MPI-2, and OpenMPI supports that.

        My point was that MPI-2 is the current standard and that if we stick to the standard, the setup can be used with other MPI implementations, such as the popular MPICH2.

        Show
        Luke Lu added a comment - mpiexec is used by MPI-2, and OpenMPI supports that. My point was that MPI-2 is the current standard and that if we stick to the standard, the setup can be used with other MPI implementations, such as the popular MPICH2.
        Hide
        Milind Bhandarkar added a comment -

        @Arun, I do not see how a ContainerLaunchContext can get the hostname and port of the 0'th container (which is the head node). (I remember Jerry had worked around this problem by making the JobClient as a 0th process. But having a gateway execute heavy-duty code is not good.)

        Show
        Milind Bhandarkar added a comment - @Arun, I do not see how a ContainerLaunchContext can get the hostname and port of the 0'th container (which is the head node). (I remember Jerry had worked around this problem by making the JobClient as a 0th process. But having a gateway execute heavy-duty code is not good.)
        Hide
        Luke Lu added a comment -

        I think un-encrypted communication among MPI processes should be acceptable.

        Hadoop RPC is authenticated but not encrypted under the assumption that there is no untrusted root process in the cluster, which is reasonable. Open-MPI doesn't (securely) authenticate connections and messages, which is not acceptable.

        Can you elaborate on your third point?

        This allow users to pass non-standard options (other than -n) to mpirun/mpiexec.

        Show
        Luke Lu added a comment - I think un-encrypted communication among MPI processes should be acceptable. Hadoop RPC is authenticated but not encrypted under the assumption that there is no untrusted root process in the cluster, which is reasonable. Open-MPI doesn't (securely) authenticate connections and messages, which is not acceptable. Can you elaborate on your third point? This allow users to pass non-standard options (other than -n) to mpirun/mpiexec.
        Hide
        Arun C Murthy added a comment -

        Milind, you can set the environment of a.out directly when launching it (@see ContainerLaunchContext) - so you might not need the extra process to launch a.out?

        Show
        Arun C Murthy added a comment - Milind, you can set the environment of a.out directly when launching it (@see ContainerLaunchContext) - so you might not need the extra process to launch a.out?
        Hide
        Milind Bhandarkar added a comment -

        @Luke

        1. Communication among processes in Hadoop, i.e. map output that gets consumed by reduce input, is not encrypted. I think un-encrypted communication among MPI processes should be acceptable.
        2. mpiexec is used by MPI-2, and OpenMPI supports that.

        Can you elaborate on your third point ?

        Show
        Milind Bhandarkar added a comment - @Luke 1. Communication among processes in Hadoop, i.e. map output that gets consumed by reduce input, is not encrypted. I think un-encrypted communication among MPI processes should be acceptable. 2. mpiexec is used by MPI-2, and OpenMPI supports that. Can you elaborate on your third point ?
        Hide
        Milind Bhandarkar added a comment -

        @Arun, the direct-launch method requires that certain environment variables are set that a.out can access. At a minimum, number of "nodes", and the host and port of the head node (i.e. process with rank 0) need to be available to all processes. Thus, we will have a tiny process that sets these environment variables, and launch a.out. When a.out calls MPI_Init(), the MPI library code will read these env vars, wait till all the processes have reported to the head node, and start execution.

        Show
        Milind Bhandarkar added a comment - @Arun, the direct-launch method requires that certain environment variables are set that a.out can access. At a minimum, number of "nodes", and the host and port of the head node (i.e. process with rank 0) need to be available to all processes. Thus, we will have a tiny process that sets these environment variables, and launch a.out. When a.out calls MPI_Init(), the MPI library code will read these env vars, wait till all the processes have reported to the head node, and start execution.
        Hide
        Luke Lu added a comment -

        Couple of comments based on what I've seen so far:

        1. We should not pick a specific MPI implementation (Open-MPI), which is not secure (unlike the current Hadoop RPC, Open-MPI wire protocols are not "secure" (with caveats)). The only secure MPI implementation that I'm aware of is a research project called ES-MPICH2, which is of course based on MPICH2.
        2. Use mpiexec instead mpirun by default as the former is specified by the MPI standard.
        3. Make environment management commands (mpiexec etc.) configurable.

        I'm on vacation, so expect high latencies in response

        Show
        Luke Lu added a comment - Couple of comments based on what I've seen so far: We should not pick a specific MPI implementation (Open-MPI), which is not secure (unlike the current Hadoop RPC, Open-MPI wire protocols are not "secure" (with caveats)). The only secure MPI implementation that I'm aware of is a research project called ES-MPICH2, which is of course based on MPICH2. Use mpiexec instead mpirun by default as the former is specified by the MPI standard. Make environment management commands (mpiexec etc.) configurable. I'm on vacation, so expect high latencies in response
        Hide
        Arun C Murthy added a comment -

        "hamster" is a client application that connects to the RM, asks it to create an AM with 32 containers, and after all of those are launched, executed a.out inside them, after setting some environment variables for connecting to the AM to get the "node list".

        Milind, I'm fairly ignorant about OpenMPI - does this mean you don't need a 'unix process' other than the one running a.out itself? Or, do you still need to run some other daemon to then fork a.out?

        Show
        Arun C Murthy added a comment - "hamster" is a client application that connects to the RM, asks it to create an AM with 32 containers, and after all of those are launched, executed a.out inside them, after setting some environment variables for connecting to the AM to get the "node list". Milind, I'm fairly ignorant about OpenMPI - does this mean you don't need a 'unix process' other than the one running a.out itself? Or, do you still need to run some other daemon to then fork a.out?
        Hide
        Milind Bhandarkar added a comment -

        Progress report (I wish more people do this on a daily basis, for the Jiras they are working on.):

        Had a great conf call with Jeff Squyres and Ralph Castain (both at Cisco, OpenMPI stewards). Both were excited that openMPI is getting hadoop co-existence. They suggested that I base Hamster implementation on the "direct-launch" model, such as with slurmd. (Having used slurm in the past, I understand why

        So, the design described above has changed. Now, the way to launch MPI jobs on a hadoop cluster is simply:

        hamster -np 32 a.out
        

        "hamster" is a client application that connects to the RM, asks it to create an AM with 32 containers, and after all of those are launched, executed a.out inside them, after setting some environment variables for connecting to the AM to get the "node list".

        That way, there are no security holes (as described above), since the MPI cluster exists only for the duration of the job.

        @Vinod, If you remember, I had sent an email on a Y-internal mailing list that HoD will make a comeback. This is it. HoD was loved very much, especially after you and Hemanth took over, in terms of stability. As I had said in that email, once the container abstraction is solidified, HoD will make a comeback. So, here it is. (Afterall, creating a slice of a shared resource has worked for the last 40 years, why won't it work now ?)

        Show
        Milind Bhandarkar added a comment - Progress report (I wish more people do this on a daily basis, for the Jiras they are working on.): Had a great conf call with Jeff Squyres and Ralph Castain (both at Cisco, OpenMPI stewards). Both were excited that openMPI is getting hadoop co-existence. They suggested that I base Hamster implementation on the "direct-launch" model, such as with slurmd. (Having used slurm in the past, I understand why So, the design described above has changed. Now, the way to launch MPI jobs on a hadoop cluster is simply: hamster -np 32 a.out "hamster" is a client application that connects to the RM, asks it to create an AM with 32 containers, and after all of those are launched, executed a.out inside them, after setting some environment variables for connecting to the AM to get the "node list". That way, there are no security holes (as described above), since the MPI cluster exists only for the duration of the job. @Vinod, If you remember, I had sent an email on a Y-internal mailing list that HoD will make a comeback. This is it. HoD was loved very much, especially after you and Hemanth took over, in terms of stability. As I had said in that email, once the container abstraction is solidified, HoD will make a comeback. So, here it is. (Afterall, creating a slice of a shared resource has worked for the last 40 years, why won't it work now ?)
        Hide
        Vinod Kumar Vavilapalli added a comment -

        This is a great addition! +1 for the simple design.

        I guess the client spins till all the containers are allocated (Oh it's HOD all over again )

        Like MR, the client could continuously ping the AM via RPC to know about the status of container-allocations. Presence/absence of HDFS file sure is a simple beginning.

        Also, if any of the container (running orte), exits abnormally, entire virtual MPI cluster is terminated. (This limitation will be removed in the next version.)

        How does the mpi client get to know about this? It'd be great if mpi-run automatically detects this.

        Security can be postponed for the first-cut.

        For now, +1 to hadoop-openmpi module under hadoop-mapreduce-project. Once we have this, we will have a platform(yarn) and two frameworks(MR and MPI) using it. Then we can move out yarn out of mapreduce-project, mapreduce and MPI can move into a frameworks aggregation module (the app-store, gawd, I can't believe I just said that )

        Show
        Vinod Kumar Vavilapalli added a comment - This is a great addition! +1 for the simple design. I guess the client spins till all the containers are allocated (Oh it's HOD all over again ) Like MR, the client could continuously ping the AM via RPC to know about the status of container-allocations. Presence/absence of HDFS file sure is a simple beginning. Also, if any of the container (running orte), exits abnormally, entire virtual MPI cluster is terminated. (This limitation will be removed in the next version.) How does the mpi client get to know about this? It'd be great if mpi-run automatically detects this. Security can be postponed for the first-cut. For now, +1 to hadoop-openmpi module under hadoop-mapreduce-project. Once we have this, we will have a platform(yarn) and two frameworks(MR and MPI) using it. Then we can move out yarn out of mapreduce-project, mapreduce and MPI can move into a frameworks aggregation module (the app-store, gawd, I can't believe I just said that )
        Hide
        Sharad Agarwal added a comment -

        hadoop-openmpi-client makes most sense (however, it also contains an app master.)

        hadoop-mapreduce-client is a misnomer. It is not a client. I think we should fix it to have hadoop-mapreduce instead. Actual client is hadoop-mapreduce-client-jobclient which should be named as hadoop-mapreduce-jobclient. If it makes sense then we should do it separately.

        hadoop-openmpi is looks better to me.

        Show
        Sharad Agarwal added a comment - hadoop-openmpi-client makes most sense (however, it also contains an app master.) hadoop-mapreduce-client is a misnomer. It is not a client. I think we should fix it to have hadoop-mapreduce instead. Actual client is hadoop-mapreduce-client-jobclient which should be named as hadoop-mapreduce-jobclient. If it makes sense then we should do it separately. hadoop-openmpi is looks better to me.
        Hide
        Arun C Murthy added a comment -

        Milind, can you use ApplicationSubmissionContext.tokens or ContainerLaunchContext.serviceData?

        Show
        Arun C Murthy added a comment - Milind, can you use ApplicationSubmissionContext.tokens or ContainerLaunchContext.serviceData?
        Hide
        Milind Bhandarkar added a comment -

        @Arun No, I don't need the whole node. In fact, looking more into the orted code, multiple mpi daemons can be made to exist on the same node. So, no issue there.

        Re: security, mentioned above, is the following acceptable?

        1. MPI App Master generates a random passphrase, and writes into nodes.lst (which is 600).
        2. Job client downloads nodes.lst to local directory (and preserves permissions).
        3. mpirun reads nodes.lst and supplies this passphrase to orted when asking it to fork executable.
        4. If passphrase matches, user-specified executable runs, otherwise it as an error.

        Show
        Milind Bhandarkar added a comment - @Arun No, I don't need the whole node. In fact, looking more into the orted code, multiple mpi daemons can be made to exist on the same node. So, no issue there. Re: security, mentioned above, is the following acceptable? 1. MPI App Master generates a random passphrase, and writes into nodes.lst (which is 600). 2. Job client downloads nodes.lst to local directory (and preserves permissions). 3. mpirun reads nodes.lst and supplies this passphrase to orted when asking it to fork executable. 4. If passphrase matches, user-specified executable runs, otherwise it as an error.
        Hide
        Arun C Murthy added a comment -

        I need at most one MPI container per physical node (until I figure out how to avoid port conflicts etc).

        Do you need one container per node or the whole node?

        I suspect you want the whole node to the 'virtual' cluster, if so just ask for all resources (RAM) on that node.

        If you need one container per node then ask for 1 container on a node and only 1 on the rack - of course you can reject containers on 'known' nodes too.

        Show
        Arun C Murthy added a comment - I need at most one MPI container per physical node (until I figure out how to avoid port conflicts etc). Do you need one container per node or the whole node? I suspect you want the whole node to the 'virtual' cluster, if so just ask for all resources (RAM) on that node. If you need one container per node then ask for 1 container on a node and only 1 on the rack - of course you can reject containers on 'known' nodes too.
        Hide
        Milind Bhandarkar added a comment -

        Just realized that if I make nodes.lst permissions 600, no other user will be able to accidentally submit jobs to the virtual MPI cluster (but malicious users can check the RM UI to see MPI AMs, and recreate nodes.lst.)

        Show
        Milind Bhandarkar added a comment - Just realized that if I make nodes.lst permissions 600, no other user will be able to accidentally submit jobs to the virtual MPI cluster (but malicious users can check the RM UI to see MPI AMs, and recreate nodes.lst.)
        Hide
        Milind Bhandarkar added a comment -

        @Arun, hadoop-openmpi-client makes most sense (however, it also contains an app master.)

        Show
        Milind Bhandarkar added a comment - @Arun, hadoop-openmpi-client makes most sense (however, it also contains an app master.)
        Hide
        Milind Bhandarkar added a comment -

        @Arun I will try my best to get the first version into 0.23.0 (but as noted above there will be a huge security hole.)

        Show
        Milind Bhandarkar added a comment - @Arun I will try my best to get the first version into 0.23.0 (but as noted above there will be a huge security hole.)
        Hide
        Milind Bhandarkar added a comment -

        The design is deliberately kept simple.

        One script, "start-mpi -np <numnodes> -out hdfs://user/milind/nodes.lst" starts the application master, which requests <numnodes> containers from resource manager, and waits till all those containers become available. The "job client" polls for application master to write a file called nodes.lst in specified location on HDFS.

        As containers become available, the application master spawns openmpi runtime environment daemon (orted) in each of those containers.

        When job client notices that nodes.lst is available on HDFS, it downloads it to local directory, and exits.

        MPI jobs are launched with regular:

        mpirun -np <numnodes> -nodes <nodes.lst> executable

        Multiple MPI jobs can be launched in the same virtual MPI cluster created by start-mpi script.

        After all MPI jobs are done, the cluster is dismantled with

        stop-mpi <nodes.lst>

        (first line of nodes.lst contains application master location and port.)

        Currently, there is no authentication for MPI job submission on the cluster started by the user. Thus, anyone can submit MPI jobs to any virtual MPI cluster. (I promise to do it in the next version.)

        Also, if any of the container (running orte), exits abnormally, entire virtual MPI cluster is terminated. (This limitation will be removed in the next version.)

        There is one issue I am currently facing. I need at most one MPI container per physical node (until I figure out how to avoid port conflicts etc). Any input regarding how to achieve that, is welcome. My code walkthrough of resource manager did not suggest anything obvious.

        Show
        Milind Bhandarkar added a comment - The design is deliberately kept simple. One script, "start-mpi -np <numnodes> -out hdfs://user/milind/nodes.lst" starts the application master, which requests <numnodes> containers from resource manager, and waits till all those containers become available. The "job client" polls for application master to write a file called nodes.lst in specified location on HDFS. As containers become available, the application master spawns openmpi runtime environment daemon (orted) in each of those containers. When job client notices that nodes.lst is available on HDFS, it downloads it to local directory, and exits. MPI jobs are launched with regular: mpirun -np <numnodes> -nodes <nodes.lst> executable Multiple MPI jobs can be launched in the same virtual MPI cluster created by start-mpi script. After all MPI jobs are done, the cluster is dismantled with stop-mpi <nodes.lst> (first line of nodes.lst contains application master location and port.) Currently, there is no authentication for MPI job submission on the cluster started by the user. Thus, anyone can submit MPI jobs to any virtual MPI cluster. (I promise to do it in the next version.) Also, if any of the container (running orte), exits abnormally, entire virtual MPI cluster is terminated. (This limitation will be removed in the next version.) There is one issue I am currently facing. I need at most one MPI container per physical node (until I figure out how to avoid port conflicts etc). Any input regarding how to achieve that, is welcome. My code walkthrough of resource manager did not suggest anything obvious.
        Hide
        Arun C Murthy added a comment -

        where should I place it in the source hierarchy ?

        One option is to place it alongside hadoop-mapreduce-client i.e. call it hadoop-openmpi-client. Other ideas?

        Show
        Arun C Murthy added a comment - where should I place it in the source hierarchy ? One option is to place it alongside hadoop-mapreduce-client i.e. call it hadoop-openmpi-client. Other ideas?
        Hide
        Arun C Murthy added a comment -

        Milind, I'm more than happy to ship this in 0.23.0 if possible - else it can go in 0.23.1 if necessary. Thanks for starting work on this!

        Show
        Arun C Murthy added a comment - Milind, I'm more than happy to ship this in 0.23.0 if possible - else it can go in 0.23.1 if necessary. Thanks for starting work on this!
        Hide
        Milind Bhandarkar added a comment -

        where should I place it in the source hierarchy ? Also, I am currently working off the trunk. IIn case, I get busy in other stuff, I do not want it to be blocker for 0.23.0. What's the timeline for 0.23.0 release ? I know that I wont be able to make it work on windows in the first version. I hope that does not become a blocker, too.

        Show
        Milind Bhandarkar added a comment - where should I place it in the source hierarchy ? Also, I am currently working off the trunk. IIn case, I get busy in other stuff, I do not want it to be blocker for 0.23.0. What's the timeline for 0.23.0 release ? I know that I wont be able to make it work on windows in the first version. I hope that does not become a blocker, too.

          People

          • Assignee:
            Unassigned
            Reporter:
            Milind Bhandarkar
          • Votes:
            6 Vote for this issue
            Watchers:
            77 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 336h
              336h
              Remaining:
              Remaining Estimate - 336h
              336h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development