Details

    • Type: Umbrella Umbrella
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We should take a look at how to integrate Hama's BSP Engine to Hadoop's nextGen application platform.
      Can be currently found in the 0.23 branch.

      1. YARN_trunk_integration.patch
        101 kB
        Thomas Jungblut
      2. YARN_trunk_integration_v3.patch
        107 kB
        Thomas Jungblut
      3. YARN_trunk_integration_v2.patch
        102 kB
        Thomas Jungblut
      4. WelcomeOnYarn.png
        912 kB
        Thomas Jungblut
      5. task_state.dot
        0.1 kB
        ChiaHung Lin
      6. task_phase.dot
        0.1 kB
        ChiaHung Lin
      7. job_state.dot
        0.1 kB
        ChiaHung Lin

        Issue Links

          Activity

          Hide
          Thomas Jungblut added a comment -

          So let's do this

          Show
          Thomas Jungblut added a comment - So let's do this
          Show
          Thomas Jungblut added a comment - trunk url: http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23
          Hide
          Thomas Jungblut added a comment -

          Let's see how they integrate MPI:

          https://issues.apache.org/jira/browse/MAPREDUCE-2911

          Show
          Thomas Jungblut added a comment - Let's see how they integrate MPI: https://issues.apache.org/jira/browse/MAPREDUCE-2911
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Excellent! Can help in anyways possible.

          Let's see how they integrate MPI:

          No need, can do independently. If you can write up a initial summary of any design you may have in mind, we can take it forward together. Will read-up the execution flow of HAMA myself in the meanwhile.

          Couple pointers that can help.

          Show
          Vinod Kumar Vavilapalli added a comment - Excellent! Can help in anyways possible. Let's see how they integrate MPI: No need, can do independently. If you can write up a initial summary of any design you may have in mind, we can take it forward together. Will read-up the execution flow of HAMA myself in the meanwhile. Couple pointers that can help. Implementation of MapReduce itself over YARN: http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/ Sharad Agarwal 's presentation at a HUG on writing a custom ApplicationMaster: hadoop_contributors_meet_07_01_2011.pdf You might also want to follow MAPREDUCE-2719 , and MAPREDUCE-2720 .
          Hide
          Thomas Jungblut added a comment -

          Thank you very much. I'll take a look at it.

          Show
          Thomas Jungblut added a comment - Thank you very much. I'll take a look at it.
          Hide
          Thomas Jungblut added a comment -

          Sorry for my late feedback and thank you for your help and information.

          Currently our codebase has a majority of the "old" architecture of Hadoop. We changed parts of the computation model, but task execution and job livecycle stays the same as it was in the "old" Hadoop architecture. We put a synchronization service on top of it which is working (most of the time it is not working) with zookeeper. In addition we have RPC connections between the servers to message each other.

          I suggest to implement our BSPMaster as an "Application Master". He must take care of allocating new Containers which then will be a "Groom" in our namespace. Each groom needs a ZNode and some kind of identifier.

          But there is a question of security in my mind. Do you mind when we don't care about security in the first version? I'm not an expert in these authentication systems like Kerberos.

          So everything is actually implemented some way, but we need to port this code to YARN. I have alot of time tomorrow so I just start. I also think we are going to split this task up to several smaller pieces, so our other developers can contribute to it, too.

          But I have a more general question:
          Should we make this task part of our framework? Like another maven module which can be plugged-in into Hadoop?

          Show
          Thomas Jungblut added a comment - Sorry for my late feedback and thank you for your help and information. Currently our codebase has a majority of the "old" architecture of Hadoop. We changed parts of the computation model, but task execution and job livecycle stays the same as it was in the "old" Hadoop architecture. We put a synchronization service on top of it which is working (most of the time it is not working ) with zookeeper. In addition we have RPC connections between the servers to message each other. I suggest to implement our BSPMaster as an "Application Master". He must take care of allocating new Containers which then will be a "Groom" in our namespace. Each groom needs a ZNode and some kind of identifier. But there is a question of security in my mind. Do you mind when we don't care about security in the first version? I'm not an expert in these authentication systems like Kerberos. So everything is actually implemented some way, but we need to port this code to YARN. I have alot of time tomorrow so I just start. I also think we are going to split this task up to several smaller pieces, so our other developers can contribute to it, too. But I have a more general question: Should we make this task part of our framework? Like another maven module which can be plugged-in into Hadoop?
          Hide
          Edward J. Yoon added a comment -

          But there is a question of security in my mind. Do you mind when we don't care about security in the first version? I'm not an expert in these authentication systems like Kerberos.

          You can just leave a TODO. Maybe I can help that part a bit.

          Should we make this task part of our framework? Like another maven module which can be plugged-in into Hadoop?

          Creating "new package" in a 'hama-core' would be the best I think. The biggest advantage is the all code is in one place, easier administration, and easier maintenance.

          Show
          Edward J. Yoon added a comment - But there is a question of security in my mind. Do you mind when we don't care about security in the first version? I'm not an expert in these authentication systems like Kerberos. You can just leave a TODO. Maybe I can help that part a bit. Should we make this task part of our framework? Like another maven module which can be plugged-in into Hadoop? Creating "new package" in a 'hama-core' would be the best I think. The biggest advantage is the all code is in one place, easier administration, and easier maintenance.
          Hide
          Thomas Jungblut added a comment - - edited

          Great so let's start with some kind of tutorial:

          svn checkout http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/
          

          Follow the building rules in "BUILDING.txt".
          Most of the time you'd just need to run:

          mvn compile -e -DskipTests
          

          This will retrieve the dependencies.

          If you don't have protobuf in your path, the build fails while compiling yarn-api. This is caused by the exec plugin which compiles the protobuf files (generates sources).

          [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2:exec (generate-sources) on project hadoop-yarn-api: Command execution failed. Process exited with an error: 1(Exit value: 1) -> [Help 1]
          org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2:exec (generate-sources) on project hadoop-yarn-api: Command execution failed.
          	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
          
          

          This pom.xml tries to run an executable called "protoc", so make sure you have installed protobuf correctly.
          You can download it here: http://code.google.com/p/protobuf/

          Follow the steps in INSTALL.txt.
          ./configure
          make
          make install

          Maybe "configure" fails because you don't have g++ installed, so just install it via "apt-get install g++" and then start the whole process again.
          Be careful what the output of install says. For me it told me that he layed the shared objects into "/usr/local/lib". You then have to edit your "/etc/ld.so.conf" and add the path of the protobuf shared objects to it. Reload with "ldconfig".
          Now you can try to run "protoc" on your shell, it should tell you "missing input file".

          Back to yarn you can just call

          mvn clean install -e -rf :hadoop-yarn-api -DskipTests
          

          to rerun the build process.

          Show
          Thomas Jungblut added a comment - - edited Great so let's start with some kind of tutorial: svn checkout http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/ Follow the building rules in "BUILDING.txt". Most of the time you'd just need to run: mvn compile -e -DskipTests This will retrieve the dependencies. If you don't have protobuf in your path, the build fails while compiling yarn-api. This is caused by the exec plugin which compiles the protobuf files (generates sources). [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2:exec (generate-sources) on project hadoop-yarn-api: Command execution failed. Process exited with an error: 1(Exit value: 1) -> [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2:exec (generate-sources) on project hadoop-yarn-api: Command execution failed. at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217) This pom.xml tries to run an executable called "protoc", so make sure you have installed protobuf correctly. You can download it here: http://code.google.com/p/protobuf/ Follow the steps in INSTALL.txt. ./configure make make install Maybe "configure" fails because you don't have g++ installed, so just install it via "apt-get install g++" and then start the whole process again. Be careful what the output of install says. For me it told me that he layed the shared objects into "/usr/local/lib". You then have to edit your "/etc/ld.so.conf" and add the path of the protobuf shared objects to it. Reload with "ldconfig". Now you can try to run "protoc" on your shell, it should tell you "missing input file". Back to yarn you can just call mvn clean install -e -rf :hadoop-yarn-api -DskipTests to rerun the build process.
          Hide
          Mahadev konar added a comment -
          Show
          Mahadev konar added a comment - @Thomas, This might be of help: http://wiki.apache.org/hadoop/DevelopingOnTrunkAfter279Merge
          Hide
          Thomas Jungblut added a comment - - edited

          Great thanks for the link
          But as you can see, I now figured out for myself

          Okay so our first subtask is the security.

          Show
          Thomas Jungblut added a comment - - edited Great thanks for the link But as you can see, I now figured out for myself Okay so our first subtask is the security.
          Hide
          Thomas Jungblut added a comment -

          I set up a code project for that.
          http://code.google.com/p/hama-mapreduce-integration/

          Show
          Thomas Jungblut added a comment - I set up a code project for that. http://code.google.com/p/hama-mapreduce-integration/
          Hide
          Thomas Jungblut added a comment -

          Checked in first version of BSPAppMaster and the job stuff (event/implementation etc).

          We need to rethink our states and transitions, but I think we can barely cut them down to Setup->Compute->Cleanup.
          A more sophisticated way would be to split the compute state into the supersteps. So each supersteps gets handled by a statetransition. But that could introduce a lot of overhead, but it seems to be cleaner than the handling of zookeeper. E.G. we can wait until every task has reached this transition with a cyclic barrier. Just like in the LocalBSPRunner.

          And we need to write the event dispatchers for them.
          My next goal would be to rewrite the BSPJobImpl class for our needs e.G. remove the map and reduce tasks counter and extract an interface.

          Show
          Thomas Jungblut added a comment - Checked in first version of BSPAppMaster and the job stuff (event/implementation etc). We need to rethink our states and transitions, but I think we can barely cut them down to Setup->Compute->Cleanup. A more sophisticated way would be to split the compute state into the supersteps. So each supersteps gets handled by a statetransition. But that could introduce a lot of overhead, but it seems to be cleaner than the handling of zookeeper. E.G. we can wait until every task has reached this transition with a cyclic barrier. Just like in the LocalBSPRunner. And we need to write the event dispatchers for them. My next goal would be to rewrite the BSPJobImpl class for our needs e.G. remove the map and reduce tasks counter and extract an interface.
          Hide
          Thomas Jungblut added a comment -

          Today and yesterday I did a bit on tasks, scheduling, lifecycle management.

          TODO:
          -Container things and how to get the BSPs running in reality
          -sync service with zookeeper

          Show
          Thomas Jungblut added a comment - Today and yesterday I did a bit on tasks, scheduling, lifecycle management. TODO: -Container things and how to get the BSPs running in reality -sync service with zookeeper
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Long comment, the leisure of the weekend

          Good to see the ball rolling.

          I had a browsing session on the current HAMA code(let's call this HamaV1 code) and the mapreduce-integration branch (actually this should be Yarn-integration, let's call this HamaV2).

          Some thoughts follow. Some of the following may be naive as I am new around here

          Regarding the Job and Task state machines: Yes it does look like you don't need a lot of states and their corresponding transitions here, from what I can see from HamaV1 JobInProgress and TaskInProgress. Is that because you don't have good failure handling in HamaV1 (as I read in of the presentations)? It that isn't true, ignore what follows. Otherwise, I think it is the right time to think about fault tolerance (if at all) and write down the state machines to include the faulty scenarios.

          Implementation of barrier synchronization: Not sure of the problems you ran with ZooKeeper in HamaV1, but can't we use the ApplicationMaster(AM) in HamaV2 as a barrier synchronization service? Each BSPPeer could periodically poll the AM if it can proceed to the next superstep. If and when the AM goes down, all the BSPPeers just wait there spinning till AM is restarted by the Yarn ResourceManager.
          – Pros: Avoiding ZooKeeper frees BSP from the ZK external dependency, one less service needed for running HAMA apps.
          – Cons: It robs HAMA of the the notification push vis ZK's watcher mechanism (notification push vs periodic pull) (This should be agreeable, no?).
          Thoughts?

          Regarding use of MR classes:

          • Reuse of MRV2 classes: I was appalled by the amount of Hadoop MapReduce code (kinda) forked in HamaV1. Glad that with Yarn and HamaV2, most of the forking will be gone. Still, one look at the HamaV2 code you have at Google Code tells me you are trying to mimic MRV2 (MapReduce over YARN) internals. IMO, that isn't needed as the Job, Task, TaskAttempt etc in MR have concepts specific to MapReduce like Map/Reduce tasks. I think we can redesign these objects needed for HAMA here relatively with far more ease. And that's cleaner too.
          • Code reuse from MRV2: OTOH, I do clearly see that we should re-use MRV2 components like ContainerLauncher (launches containers on nodes), RMContainerAllocator(requests containers from ResourceManager), I'll see how we can move these to a separate common library module from MRV2 so that Hama(and possibly others) can use them.

          Meta comment: Instead of jumping into writing the implementation, I think it helps to spend some time developing the design till it reaches some level of stability and then writing down the module structure(like BspAppMaster module, BspChild module etc.), followed by the interfaces of all the data objects and the components and finally wiring them together. Once we have all the interfaces and communication patterns in place, implementation can be done in parallel. It did help us writing MRV2 a lot cleaner, am sure it will help us here too.

          General infra thought: I think having this branch at apache svn helps HAMA's incubation status. Also it will be easy for anyone else from the current hama-dev interested in working on this to use apache lists, svn etc. (Oh, BTW, I am looking for collaborating too ). What do you think?

          Show
          Vinod Kumar Vavilapalli added a comment - Long comment, the leisure of the weekend Good to see the ball rolling. I had a browsing session on the current HAMA code(let's call this HamaV1 code) and the mapreduce-integration branch (actually this should be Yarn-integration, let's call this HamaV2). Some thoughts follow. Some of the following may be naive as I am new around here Regarding the Job and Task state machines : Yes it does look like you don't need a lot of states and their corresponding transitions here, from what I can see from HamaV1 JobInProgress and TaskInProgress . Is that because you don't have good failure handling in HamaV1 (as I read in of the presentations)? It that isn't true, ignore what follows. Otherwise, I think it is the right time to think about fault tolerance (if at all) and write down the state machines to include the faulty scenarios. Implementation of barrier synchronization : Not sure of the problems you ran with ZooKeeper in HamaV1, but can't we use the ApplicationMaster (AM) in HamaV2 as a barrier synchronization service? Each BSPPeer could periodically poll the AM if it can proceed to the next superstep. If and when the AM goes down, all the BSPPeers just wait there spinning till AM is restarted by the Yarn ResourceManager . – Pros: Avoiding ZooKeeper frees BSP from the ZK external dependency, one less service needed for running HAMA apps. – Cons: It robs HAMA of the the notification push vis ZK's watcher mechanism (notification push vs periodic pull) (This should be agreeable, no?). Thoughts? Regarding use of MR classes : Reuse of MRV2 classes : I was appalled by the amount of Hadoop MapReduce code (kinda) forked in HamaV1. Glad that with Yarn and HamaV2, most of the forking will be gone. Still, one look at the HamaV2 code you have at Google Code tells me you are trying to mimic MRV2 (MapReduce over YARN) internals. IMO, that isn't needed as the Job, Task, TaskAttempt etc in MR have concepts specific to MapReduce like Map/Reduce tasks. I think we can redesign these objects needed for HAMA here relatively with far more ease. And that's cleaner too. Code reuse from MRV2 : OTOH, I do clearly see that we should re-use MRV2 components like ContainerLauncher (launches containers on nodes), RMContainerAllocator(requests containers from ResourceManager), I'll see how we can move these to a separate common library module from MRV2 so that Hama(and possibly others) can use them. Meta comment : Instead of jumping into writing the implementation, I think it helps to spend some time developing the design till it reaches some level of stability and then writing down the module structure(like BspAppMaster module, BspChild module etc.), followed by the interfaces of all the data objects and the components and finally wiring them together. Once we have all the interfaces and communication patterns in place, implementation can be done in parallel. It did help us writing MRV2 a lot cleaner, am sure it will help us here too. General infra thought : I think having this branch at apache svn helps HAMA's incubation status. Also it will be easy for anyone else from the current hama-dev interested in working on this to use apache lists, svn etc. (Oh, BTW, I am looking for collaborating too ). What do you think?
          Hide
          Thomas Jungblut added a comment -

          Wow that's a wall of text

          I'm no contributor (yet?), so I don't have SVN access, that was the main reason I choose the Google Code repo.
          Yes we took a lot of Hadoop's old code for HamaV1, in these days we don't have failure recovery, detection should be on it's way (HAMA-370).

          Fault tolerance in HamaV2 should basically just check if a container is available through some kind of heartbeat. If a task isn't responding, we should roll back to the state it was before. The Task is responsible for state saving every superstep e.G. the messages received by other peers. This should be planted in HDFS along with the task-id so the AM can rerun the task with this input. -> we need some kind of task attempts.

          Implementation of barrier synchronization:
          I would be very glad if we can get away from Zookeepers Sync service, we had a lot of ideas how to make it running (see HAMA-387) but it doesn't help. Edward asked a question on their user list, but they offered just the same ideas we have tried out before.

          This should be agreeable, no?

          Polling is totally agreeable. I very much doubt that Zookeeper isn't internally polling either.

          Reuse of MRV2 classes

          As you might see I totally reuse your classes. It's cool, but it is more work to cut down your statemachine handling to something simpler than rewriting it from scratch.

          I do clearly see that we should re-use MRV2 components like ContainerLauncher (launches containers on nodes), RMContainerAllocator(requests containers from ResourceManager), I'll see how we can move these to a separate common library module from MRV2 so that Hama(and possibly others) can use them.

          +1, that would be great.

          Instead of jumping into writing the implementation,I think it helps to spend some time developing the design till it reaches some level of stability and then writing down the module structure [...]

          You are right.

          Show
          Thomas Jungblut added a comment - Wow that's a wall of text I'm no contributor (yet?), so I don't have SVN access, that was the main reason I choose the Google Code repo. Yes we took a lot of Hadoop's old code for HamaV1, in these days we don't have failure recovery, detection should be on it's way ( HAMA-370 ). Fault tolerance in HamaV2 should basically just check if a container is available through some kind of heartbeat. If a task isn't responding, we should roll back to the state it was before. The Task is responsible for state saving every superstep e.G. the messages received by other peers. This should be planted in HDFS along with the task-id so the AM can rerun the task with this input. -> we need some kind of task attempts. Implementation of barrier synchronization: I would be very glad if we can get away from Zookeepers Sync service, we had a lot of ideas how to make it running (see HAMA-387 ) but it doesn't help. Edward asked a question on their user list, but they offered just the same ideas we have tried out before. This should be agreeable, no? Polling is totally agreeable. I very much doubt that Zookeeper isn't internally polling either. Reuse of MRV2 classes As you might see I totally reuse your classes. It's cool, but it is more work to cut down your statemachine handling to something simpler than rewriting it from scratch. I do clearly see that we should re-use MRV2 components like ContainerLauncher (launches containers on nodes), RMContainerAllocator(requests containers from ResourceManager), I'll see how we can move these to a separate common library module from MRV2 so that Hama(and possibly others) can use them. +1, that would be great. Instead of jumping into writing the implementation,I think it helps to spend some time developing the design till it reaches some level of stability and then writing down the module structure [...] You are right.
          Hide
          ChiaHung Lin added a comment -

          Some thoughts base what I have known so far, but I may be wrong (or miss the point) and probably do not see the whole forest.

          Each BSPPeer could periodically poll the AM if it can proceed to the next superstep. ...

          With polling, it seems that chances are the polling would not reach the agreement (there could always have 1 process missing) in an unfortunate timing case. Also, as the processes increase probably it would increase the loading for master to deal with polling tasks.

          In addition, my understanding is the integration with MRV2 would be just an additional support so that MR job/ application can be submitted without rewriting to use hama for computation.

          Show
          ChiaHung Lin added a comment - Some thoughts base what I have known so far, but I may be wrong (or miss the point) and probably do not see the whole forest. Each BSPPeer could periodically poll the AM if it can proceed to the next superstep. ... With polling, it seems that chances are the polling would not reach the agreement (there could always have 1 process missing) in an unfortunate timing case. Also, as the processes increase probably it would increase the loading for master to deal with polling tasks. In addition, my understanding is the integration with MRV2 would be just an additional support so that MR job/ application can be submitted without rewriting to use hama for computation.
          Hide
          Thomas Jungblut added a comment -

          With polling, it seems that chances are the polling would not reach the agreement (there could always have 1 process missing) in an unfortunate timing case. Also, as the processes increase probably it would increase the loading for master to deal with polling tasks.

          This is correct, but depends highly on the polling intervall. As far as I understood each BSPJob gets its own Application Master. So there is no "Master-Machine" anymore like our BSPMaster or Jobtracker.

          We have two options:

          • fix the barrier sync with zookeeper and use it in the AM and Peers
          • do the polling in Application Master

          In addition, my understanding is the integration with MRV2 would be just an additional support so that MR job/ application can be submitted without rewriting to use hama for computation.

          That is right as well. I think we should make this a configuration based decision whether YARN (or the URL) has been set or not.

          Show
          Thomas Jungblut added a comment - With polling, it seems that chances are the polling would not reach the agreement (there could always have 1 process missing) in an unfortunate timing case. Also, as the processes increase probably it would increase the loading for master to deal with polling tasks. This is correct, but depends highly on the polling intervall. As far as I understood each BSPJob gets its own Application Master. So there is no "Master-Machine" anymore like our BSPMaster or Jobtracker. We have two options: fix the barrier sync with zookeeper and use it in the AM and Peers do the polling in Application Master In addition, my understanding is the integration with MRV2 would be just an additional support so that MR job/ application can be submitted without rewriting to use hama for computation. That is right as well. I think we should make this a configuration based decision whether YARN (or the URL) has been set or not.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Thanks Thomas for your replies.

          I'm no contributor (yet?), so I don't have SVN access, that was the main reason I choose the Google Code repo.

          We can work with patches, but that won't scale. I think we need to get commit privileges with a promise to restrict ourselves to a branch. I noticed Edward is off for a week, may be he can pull some strings when he's back?

          Fault tolerance in HamaV2 should basically ..

          If this isn't already there in V1, it makes sense to take this up as a follow up to the first cut of V2.

          As you might see I totally reuse your classes. It's cool, but it is more work to cut down your statemachine handling to something simpler than rewriting it from scratch.

          Yes, I propose that we start afresh. As you mentioned, it is lot less work than trying to cut down the statemachine and peeling off MR specific stuff.

          Show
          Vinod Kumar Vavilapalli added a comment - Thanks Thomas for your replies. I'm no contributor (yet?), so I don't have SVN access, that was the main reason I choose the Google Code repo. We can work with patches, but that won't scale. I think we need to get commit privileges with a promise to restrict ourselves to a branch. I noticed Edward is off for a week, may be he can pull some strings when he's back? Fault tolerance in HamaV2 should basically .. If this isn't already there in V1, it makes sense to take this up as a follow up to the first cut of V2. As you might see I totally reuse your classes. It's cool, but it is more work to cut down your statemachine handling to something simpler than rewriting it from scratch. Yes, I propose that we start afresh. As you mentioned, it is lot less work than trying to cut down the statemachine and peeling off MR specific stuff.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          ChiaHung,

          With polling, it seems that chances are the polling would not reach the agreement (there could always have 1 process missing) in an unfortunate timing case. Also, as the processes increase probably it would increase the loading for master to deal with polling tasks.

          Regarding the missing processes, which we call stragglers in mapreduce, isn't the API such that there should be no progress till all the processes perform the barrier sync?
          Regarding the load, even MR AM which uses a Hadoop RPC server has similar requirements, in the order of ten's of thousands of tasks. That amount of scalability should be enough for Hama's case also. And like Thomas mentioned, each BSPMaster is needed to serve the same job's BSPPeers, so that should help too.

          In addition, my understanding is the integration with MRV2 would be just an additional support so that MR job/ application can be submitted without rewriting to use hama for computation.

          It is not clear to me. But if you are talking of the ability to run the current BSP jobs without rewriting them, then yes, we will support API level compatibility.

          Show
          Vinod Kumar Vavilapalli added a comment - ChiaHung, With polling, it seems that chances are the polling would not reach the agreement (there could always have 1 process missing) in an unfortunate timing case. Also, as the processes increase probably it would increase the loading for master to deal with polling tasks. Regarding the missing processes, which we call stragglers in mapreduce, isn't the API such that there should be no progress till all the processes perform the barrier sync? Regarding the load, even MR AM which uses a Hadoop RPC server has similar requirements, in the order of ten's of thousands of tasks. That amount of scalability should be enough for Hama's case also. And like Thomas mentioned, each BSPMaster is needed to serve the same job's BSPPeers, so that should help too. In addition, my understanding is the integration with MRV2 would be just an additional support so that MR job/ application can be submitted without rewriting to use hama for computation. It is not clear to me. But if you are talking of the ability to run the current BSP jobs without rewriting them, then yes, we will support API level compatibility.
          Hide
          Thomas Jungblut added a comment -

          Thanks Vinod,

          We can work with patches, but that won't scale. I think we need to get commit privileges with a promise to restrict ourselves to a branch. I noticed Edward is off for a week, may be he can pull some strings when he's back?

          He pulled some strings yesterday. My account is on its way. I guess we can start in 1-2 days.

          If this isn't already there in V1, it makes sense to take this up as a follow up to the first cut of V2.

          Yes, but I think we should take things like TaskAttempts into account.
          The roll-back of an attempt will be another task and should be scheduled for V2. The extending of the state machine for handling these events will be another task, too.

          So our first version will solely be:

          • Message passing between peers
          • Barrier Sync with control of the ApplicationMaster
          • Job submission via the current BSPJob Class.

          And the statemachine will just be:
          Setup->Computation->Cleanup

          Anything else we should take into account?
          @ChiaHung do you want to help us?

          Show
          Thomas Jungblut added a comment - Thanks Vinod, We can work with patches, but that won't scale. I think we need to get commit privileges with a promise to restrict ourselves to a branch. I noticed Edward is off for a week, may be he can pull some strings when he's back? He pulled some strings yesterday. My account is on its way. I guess we can start in 1-2 days. If this isn't already there in V1, it makes sense to take this up as a follow up to the first cut of V2. Yes, but I think we should take things like TaskAttempts into account. The roll-back of an attempt will be another task and should be scheduled for V2. The extending of the state machine for handling these events will be another task, too. So our first version will solely be: Message passing between peers Barrier Sync with control of the ApplicationMaster Job submission via the current BSPJob Class. And the statemachine will just be: Setup->Computation->Cleanup Anything else we should take into account? @ChiaHung do you want to help us?
          Hide
          ChiaHung Lin added a comment -

          Vinod,

          Regarding the missing processes, which we call stragglers in mapreduce, isn't the API such that there should be no progress till all the processes perform the barrier sync?

          Yes, in that case there would have no progress. However, it differs from the barrier sync with zookeeper in that there could always have different stragglers do not poll each round due to networking loading, etc. For instance, with time interval e.g. 1 secs each GroomServer polls to check if he can proceed; unfortunately due to network congestion, master server always receives parts of response (not response from all GroomServers). So the rate of barrier sync with no progress probably could be higher than expected. Or we will have master to help coordinate between stragglers, but this seems the tasks that should be handled by zookeeper service. In addition, if it is going to have multiple masters, to replicate the poll information should also be taken into account.

          I was just thinking some issues that maybe we need to consider beforehand if it is decided to work toward this direction. Thanks Vinod, that inspires me a lot.

          Show
          ChiaHung Lin added a comment - Vinod, Regarding the missing processes, which we call stragglers in mapreduce, isn't the API such that there should be no progress till all the processes perform the barrier sync? Yes, in that case there would have no progress. However, it differs from the barrier sync with zookeeper in that there could always have different stragglers do not poll each round due to networking loading, etc. For instance, with time interval e.g. 1 secs each GroomServer polls to check if he can proceed; unfortunately due to network congestion, master server always receives parts of response (not response from all GroomServers). So the rate of barrier sync with no progress probably could be higher than expected. Or we will have master to help coordinate between stragglers, but this seems the tasks that should be handled by zookeeper service. In addition, if it is going to have multiple masters, to replicate the poll information should also be taken into account. I was just thinking some issues that maybe we need to consider beforehand if it is decided to work toward this direction. Thanks Vinod, that inspires me a lot.
          Hide
          Thomas Jungblut added a comment -

          Regarding the barrier sync:
          My idea is that we have two RPC calls, enterBarrier() and leaveBarrier().
          In the ApplicationMaster we can handle each superstep via a CyclicBarrier[1] on the number of tasks.
          So the RPC call is going from the client container to the ApplicationMaster, this call is going through the barrier, causing the clients to wait until the barrier is tripped. Then the RPC call returns to the clients, they send their messages and the whole thing is repeated with leaveBarrier().

          In this case we don't have to poll for completion and it some higher level construct that works (see LocalBSPRunner).

          [1] http://download.oracle.com/javase/1,5.0/docs/api/java/util/concurrent/CyclicBarrier.html

          Show
          Thomas Jungblut added a comment - Regarding the barrier sync: My idea is that we have two RPC calls, enterBarrier() and leaveBarrier(). In the ApplicationMaster we can handle each superstep via a CyclicBarrier [1] on the number of tasks. So the RPC call is going from the client container to the ApplicationMaster, this call is going through the barrier, causing the clients to wait until the barrier is tripped. Then the RPC call returns to the clients, they send their messages and the whole thing is repeated with leaveBarrier(). In this case we don't have to poll for completion and it some higher level construct that works (see LocalBSPRunner). [1] http://download.oracle.com/javase/1,5.0/docs/api/java/util/concurrent/CyclicBarrier.html
          Hide
          Thomas Jungblut added a comment -

          And it works:
          https://github.com/thomasjungblut/barriersync/

          So no need for polling or zookeeper.

          Show
          Thomas Jungblut added a comment - And it works: https://github.com/thomasjungblut/barriersync/ So no need for polling or zookeeper.
          Hide
          Thomas Jungblut added a comment -

          What do we need then?

          • I propose to use my barrier sync code, so we don't introduce dependecy of Zookeeper.
          • ApplicationMaster
          • BSPJob that has the ultimate state machine
          Show
          Thomas Jungblut added a comment - What do we need then? I propose to use my barrier sync code, so we don't introduce dependecy of Zookeeper. ApplicationMaster BSPJob that has the ultimate state machine
          Hide
          Thomas Jungblut added a comment -

          Okay Vinod, I need your professional help.

          I started playing around with the ApplicationMaster and I think we need a ContainerLauncher right?
          The deal with the ContainerLauncher is that you need an AppContext to launch this. Unfortunately the AppContext needs a M/R JobID and M/R Job for their methods.
          Is there a chance to refactor this? Or should we just import this but not use them?

          Show
          Thomas Jungblut added a comment - Okay Vinod, I need your professional help. I started playing around with the ApplicationMaster and I think we need a ContainerLauncher right? The deal with the ContainerLauncher is that you need an AppContext to launch this. Unfortunately the AppContext needs a M/R JobID and M/R Job for their methods. Is there a chance to refactor this? Or should we just import this but not use them?
          Hide
          ChiaHung Lin added a comment -

          Attached files are roughly sketched state graph (via Graphviz). Task state is a bit unclear to me so there may be something missing. Please help correct the diagram (or anything related), I can then update one with fault scenario.

          Show
          ChiaHung Lin added a comment - Attached files are roughly sketched state graph (via Graphviz). Task state is a bit unclear to me so there may be something missing. Please help correct the diagram (or anything related), I can then update one with fault scenario.
          Hide
          Thomas Jungblut added a comment -

          Thanks for the state machine. Looks good.
          A fault scenario would be great, too.
          I would create the statemachine for fault cases in the first implementation, but with no action triggered. So we can easily add this later.

          Let's assemble what daemons we have to launch from the application master:

          • n BSP Tasks
          • a checkpointer per host? [1]
          • the sync daemon [2]

          [1] I see a real problem here, maybe we have to integrate this into a running task / BSPPeer.
          [2] Based on where YARN is scheduling the container it might check for a free port. How do we get the hostname:port of this maschine then?

          Show
          Thomas Jungblut added a comment - Thanks for the state machine. Looks good. A fault scenario would be great, too. I would create the statemachine for fault cases in the first implementation, but with no action triggered. So we can easily add this later. Let's assemble what daemons we have to launch from the application master: n BSP Tasks a checkpointer per host? [1] the sync daemon [2] [1] I see a real problem here, maybe we have to integrate this into a running task / BSPPeer. [2] Based on where YARN is scheduling the container it might check for a free port. How do we get the hostname:port of this maschine then?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I started playing around with the ApplicationMaster and I think we need a ContainerLauncher right?

          Don't import or copy over the ContainerLauncher. All you need is the code in there to start and stop containers.

          Regarding the state machines, we will need some kind of representation for the barrier sync in the job too.

          a checkpointer per host? the sync daemon

          What is the purpose of these two?

          I'll be hanging around at #hama channel at freenode, we can sync up w.r.t implementation details. (My timezone is IST)

          Show
          Vinod Kumar Vavilapalli added a comment - I started playing around with the ApplicationMaster and I think we need a ContainerLauncher right? Don't import or copy over the ContainerLauncher. All you need is the code in there to start and stop containers. Regarding the state machines, we will need some kind of representation for the barrier sync in the job too. a checkpointer per host? the sync daemon What is the purpose of these two? I'll be hanging around at #hama channel at freenode, we can sync up w.r.t implementation details. (My timezone is IST)
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Okay, looked at your code on github. Seems like Sync daemon can be started by the ApplicationMaster itself.

          Still not sure about the checkpointer.

          Show
          Vinod Kumar Vavilapalli added a comment - Okay, looked at your code on github. Seems like Sync daemon can be started by the ApplicationMaster itself. Still not sure about the checkpointer.
          Hide
          Thomas Jungblut added a comment -

          Don't import or copy over the ContainerLauncher. All you need is the code in there to start and stop containers.

          Oh okay. I'll remove them later. Can you provide a tiny code example what is needed to launch a container and how it should look like in our cases?

          Regarding the state machines, we will need some kind of representation for the barrier sync in the job too.

          I would not track this via the state machine. But it can be possible if we can integrate the sync daemon into the application master.
          So the application master will take care of the sync. Aren't the RPC services blocking each other then? I tested an integration with our Groom and BSPMaster, and it totally failed, so I had to put this into another process.

          Seems like Sync daemon can be started by the ApplicationMaster itself.

          Yes, there is still the question how to get the host:port of this daemon after it has been launched. Is there a kind of communication between the starter and the container?

          Still not sure about the checkpointer.

          Me neither, so we can leave this point open and revise the checkpointing mechanism later. I don't want to be inconsistent to the current features, but I think each task has to make its own checkpointing.

          I'll be hanging around at #hama channel at freenode, we can sync up w.r.t implementation details. (My timezone is IST)

          I was a few days off so I wasn't there, you probably noticed, I'm sorry.

          Show
          Thomas Jungblut added a comment - Don't import or copy over the ContainerLauncher. All you need is the code in there to start and stop containers. Oh okay. I'll remove them later. Can you provide a tiny code example what is needed to launch a container and how it should look like in our cases? Regarding the state machines, we will need some kind of representation for the barrier sync in the job too. I would not track this via the state machine. But it can be possible if we can integrate the sync daemon into the application master. So the application master will take care of the sync. Aren't the RPC services blocking each other then? I tested an integration with our Groom and BSPMaster, and it totally failed, so I had to put this into another process. Seems like Sync daemon can be started by the ApplicationMaster itself. Yes, there is still the question how to get the host:port of this daemon after it has been launched. Is there a kind of communication between the starter and the container? Still not sure about the checkpointer. Me neither, so we can leave this point open and revise the checkpointing mechanism later. I don't want to be inconsistent to the current features, but I think each task has to make its own checkpointing. I'll be hanging around at #hama channel at freenode, we can sync up w.r.t implementation details. (My timezone is IST) I was a few days off so I wasn't there, you probably noticed, I'm sorry.
          Hide
          ChiaHung Lin added a comment -

          Yes, checkpoint at the moment provides saving data to hdfs per host. The primary reason having a separated checkpointing process is to ensure bsp task would continuously process even in the presence of checkpointing service failure. Although we can combine chececkpointing process with bsp task together, chances are if the checkpointing process fails this may propergate to bsp task resulting in the collapse of bsp task. I think that Joe Armstrong's paper[1] explains this well.

          [1]. Making reliable distributed systems in the presence of software errors. http://www.sics.se/~joe/thesis/armstrong_thesis_2003.pdf

          Show
          ChiaHung Lin added a comment - Yes, checkpoint at the moment provides saving data to hdfs per host. The primary reason having a separated checkpointing process is to ensure bsp task would continuously process even in the presence of checkpointing service failure. Although we can combine chececkpointing process with bsp task together, chances are if the checkpointing process fails this may propergate to bsp task resulting in the collapse of bsp task. I think that Joe Armstrong's paper [1] explains this well. [1] . Making reliable distributed systems in the presence of software errors. http://www.sics.se/~joe/thesis/armstrong_thesis_2003.pdf
          Hide
          Thomas Jungblut added a comment -

          Yes that is correct.
          But I don't see the improvement, if a task fails, the checkpointer also fails within the task. If you seperate the checkpointer as a seperate process which guards several tasks, it can fail the tasks it guards if the process is not working properly. Armstrong is just referring to the need of redundancy to absorb failure, but with a single process which is guarding several tasks you have introduced another point of failure which can have a lot more impact than a single task which fails.

          Each task attempt should write the checkpoints with its taskID, attemptID and superstep (as name?) into HDFS so it can be restarted from outside.
          That's just my opinion on that, but you're the fault-tolerance professional

          But I would leave this outside for now and we can open another issue that will add this. In this issue we can talk about the benefits of another process.

          Show
          Thomas Jungblut added a comment - Yes that is correct. But I don't see the improvement, if a task fails, the checkpointer also fails within the task. If you seperate the checkpointer as a seperate process which guards several tasks, it can fail the tasks it guards if the process is not working properly. Armstrong is just referring to the need of redundancy to absorb failure, but with a single process which is guarding several tasks you have introduced another point of failure which can have a lot more impact than a single task which fails. Each task attempt should write the checkpoints with its taskID, attemptID and superstep (as name?) into HDFS so it can be restarted from outside. That's just my opinion on that, but you're the fault-tolerance professional But I would leave this outside for now and we can open another issue that will add this. In this issue we can talk about the benefits of another process.
          Hide
          Thomas Jungblut added a comment -

          Thanks for the tutorial in the YARN site module. Although it is not complete and sometimes there are variables which were never declared it is really helpful.
          I just added the container allocation and start of the sync server as a daemon inside of the application master.
          Then I added the BSPTaskLauncher which will then spawn the BSP tasks.

          I decided to not follow the statemachine handling like MapReduce, because I think this event handling is worse than GOTOs are. It is not transparent (maybe not only to me?) from which point the event handler is handling the events and stuff.
          I don't think we need this actor model now and we should have a simple first snapshot which works and is easy to develop with.
          BSP does not have this task execution like MapReduce and we don't need the capability to schedule Tasks during the computation (excluding failure).

          TODO:

          • the real launching of the tasks within the BSPTaskLauncher
          • task/job/overall cleanup
          • the client integration
          • a lot of testcases...
          Show
          Thomas Jungblut added a comment - Thanks for the tutorial in the YARN site module. Although it is not complete and sometimes there are variables which were never declared it is really helpful. I just added the container allocation and start of the sync server as a daemon inside of the application master. Then I added the BSPTaskLauncher which will then spawn the BSP tasks. I decided to not follow the statemachine handling like MapReduce, because I think this event handling is worse than GOTOs are. It is not transparent (maybe not only to me?) from which point the event handler is handling the events and stuff. I don't think we need this actor model now and we should have a simple first snapshot which works and is easy to develop with. BSP does not have this task execution like MapReduce and we don't need the capability to schedule Tasks during the computation (excluding failure). TODO: the real launching of the tasks within the BSPTaskLauncher task/job/overall cleanup the client integration a lot of testcases...
          Hide
          ChiaHung Lin added a comment -

          Indeed, 3 checkpointer processes were one solution implemented previously. It was just that having 3 processes doing the same things seems too redundant. Thus the implementation was shifted because the first goal is to ensure bsp task working smoothly. We can discuss this in other threads/ issues if needed. And yes at the moment the checkpointed data is written to hdfs with jobid/supersteps/taskid so that data recovery would be possible. : )

          Show
          ChiaHung Lin added a comment - Indeed, 3 checkpointer processes were one solution implemented previously. It was just that having 3 processes doing the same things seems too redundant. Thus the implementation was shifted because the first goal is to ensure bsp task working smoothly. We can discuss this in other threads/ issues if needed. And yes at the moment the checkpointed data is written to hdfs with jobid/supersteps/taskid so that data recovery would be possible. : )
          Hide
          Thomas Jungblut added a comment -

          Progress update:

          • Cleanup for everything has been implemented.
          • Launching of tasks, too. But it has some flaws, e.G. that we should use the containers methods to pass the classpath and jars, as well as the confs.

          I guess we can split this task to the client integration and checkpointing integration. Any opinions?

          Show
          Thomas Jungblut added a comment - Progress update: Cleanup for everything has been implemented. Launching of tasks, too. But it has some flaws, e.G. that we should use the containers methods to pass the classpath and jars, as well as the confs. I guess we can split this task to the client integration and checkpointing integration. Any opinions?
          Hide
          Thomas Jungblut added a comment -

          Just scripted the client so we have first complete snapshot.
          I'd like to run this now, but I have troubles assembling the hadoop packages together and run YARN.
          Is there a complete pre-release tarball? mvn assembly:assembly does not work on Hadoop Main.

          The best is that this modules is now quite independend of the hama core, I just use the interfaces and the taskid/jobid classes.
          So we can reroll the efford in splitting the modules... If we want to.

          Show
          Thomas Jungblut added a comment - Just scripted the client so we have first complete snapshot. I'd like to run this now, but I have troubles assembling the hadoop packages together and run YARN. Is there a complete pre-release tarball? mvn assembly:assembly does not work on Hadoop Main. The best is that this modules is now quite independend of the hama core, I just use the interfaces and the taskid/jobid classes. So we can reroll the efford in splitting the modules... If we want to.
          Hide
          Thomas Jungblut added a comment -

          Just a quick note:
          I would be glad if someone would review what I've written.

          Show
          Thomas Jungblut added a comment - Just a quick note: I would be glad if someone would review what I've written.
          Hide
          Edward J. Yoon added a comment -

          I will!

          Show
          Edward J. Yoon added a comment - I will!
          Hide
          Thomas Jungblut added a comment -

          I think an abstract BSPPeer class which summarizes the methods we have in common could be worth the work.
          I should refactor this.

          Show
          Thomas Jungblut added a comment - I think an abstract BSPPeer class which summarizes the methods we have in common could be worth the work. I should refactor this.
          Hide
          Thomas Jungblut added a comment -

          I have a problem launching the APP master.

          I'm putting the classpath together via suggested code:

              // Assuming our classes or jars are available as local resources in the
              // working directory from which the command will be run, we need to append
              // "." to the path.
              // By default, all the hadoop specific classpaths will already be available
              // in $CLASSPATH, so we should be careful not to overwrite it.
              String classPathEnv = "$CLASSPATH:./*:";
              env.put("CLASSPATH", classPathEnv);
              amContainer.setEnvironment(env);
          

          The app master gets started with following start command:

          Start command: ${JAVA_HOME}/bin/java -cp $CLASSPATH:./*: org.apache.hama.bsp.BSPApplicationMaster file:/home/thomasjungblut/Desktop/application_1318323647317_0012/job.xml 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr
          

          Which is correct, assuming that $CLASSPATH has been set properly by YARN.

          But I think that it is not:

          11/10/11 11:37:03 INFO ipc.YarnRPC: Creating YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC
          Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/avro/ipc/Server
                  at java.lang.Class.forName0(Native Method)
                  at java.lang.Class.forName(Class.java:169)
                  at org.apache.hadoop.yarn.ipc.YarnRPC.create(YarnRPC.java:53)
                  at org.apache.hama.bsp.BSPApplicationMaster.<init>(BSPApplicationMaster.java:104)
                  at org.apache.hama.bsp.BSPApplicationMaster.main(BSPApplicationMaster.java:233)
          Caused by: java.lang.ClassNotFoundException: org.apache.avro.ipc.Server
                  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
                  at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
                  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
                  at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
                  ... 5 more
          

          Is there a suggestion on how to solve this?

          FYI, I'm submitting the dummy job via "yarn/bin/yarn jar xyz.jar". The generated start command contains all the jars that should be needed. So classpath should be set properly.

          Show
          Thomas Jungblut added a comment - I have a problem launching the APP master. I'm putting the classpath together via suggested code: // Assuming our classes or jars are available as local resources in the // working directory from which the command will be run, we need to append // "." to the path. // By default, all the hadoop specific classpaths will already be available // in $CLASSPATH, so we should be careful not to overwrite it. String classPathEnv = "$CLASSPATH:./*:"; env.put("CLASSPATH", classPathEnv); amContainer.setEnvironment(env); The app master gets started with following start command: Start command: ${JAVA_HOME}/bin/java -cp $CLASSPATH:./*: org.apache.hama.bsp.BSPApplicationMaster file:/home/thomasjungblut/Desktop/application_1318323647317_0012/job.xml 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr Which is correct, assuming that $CLASSPATH has been set properly by YARN. But I think that it is not: 11/10/11 11:37:03 INFO ipc.YarnRPC: Creating YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/avro/ipc/Server at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:169) at org.apache.hadoop.yarn.ipc.YarnRPC.create(YarnRPC.java:53) at org.apache.hama.bsp.BSPApplicationMaster.<init>(BSPApplicationMaster.java:104) at org.apache.hama.bsp.BSPApplicationMaster.main(BSPApplicationMaster.java:233) Caused by: java.lang.ClassNotFoundException: org.apache.avro.ipc.Server at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 5 more Is there a suggestion on how to solve this? FYI, I'm submitting the dummy job via "yarn/bin/yarn jar xyz.jar". The generated start command contains all the jars that should be needed. So classpath should be set properly.
          Hide
          Thomas Jungblut added a comment -

          Oh sorry, that is the avro dependency. I was distracted by the YARN class. I try with avro jar in CP.

          Show
          Thomas Jungblut added a comment - Oh sorry, that is the avro dependency. I was distracted by the YARN class. I try with avro jar in CP.
          Hide
          Thomas Jungblut added a comment -

          Putting Avro into CP worked fine.
          But I'm facing relative huge issues with our project now.

          In every of our module, we have hadoop-20.2 as dependency.
          SERVER module is depending on API and CORE, now hadoop-20.2 ships within every jar export / snapshot build. Old classes override the new ones during the build process.

          Should I change the hadoop 20.2 dependency to 23.0 in every module to fix this? I see several compatibility issues in our main packages, especially with Zookeeper.

          I'm still thinking about just making the whole module depending on a Hama-0.4.0 SNAPSHOT and develop further. The integration totally sucks.

          Show
          Thomas Jungblut added a comment - Putting Avro into CP worked fine. But I'm facing relative huge issues with our project now. In every of our module, we have hadoop-20.2 as dependency. SERVER module is depending on API and CORE, now hadoop-20.2 ships within every jar export / snapshot build. Old classes override the new ones during the build process. Should I change the hadoop 20.2 dependency to 23.0 in every module to fix this? I see several compatibility issues in our main packages, especially with Zookeeper. I'm still thinking about just making the whole module depending on a Hama-0.4.0 SNAPSHOT and develop further. The integration totally sucks.
          Hide
          Edward J. Yoon added a comment -

          Looks good to me.

          By the way, you'll use syncServer instead of Zookeeper?

          Show
          Edward J. Yoon added a comment - Looks good to me. By the way, you'll use syncServer instead of Zookeeper?
          Hide
          Thomas Jungblut added a comment -

          Thanks, yes. I don't really know if it works better than Zookeeper.

          Somehow I face classpath issues when starting the containers... I give you more information later on.

          Show
          Thomas Jungblut added a comment - Thanks, yes. I don't really know if it works better than Zookeeper. Somehow I face classpath issues when starting the containers... I give you more information later on.
          Hide
          Thomas Jungblut added a comment -

          Okay I fixed the last issues.

          I would now open some subtasks. For example to cleanup the TODOs or test with a distributed YARN (currently working with pseudo distribution).
          Especially profiling the ApplicationMaster would be a task too. Its container will be killed if less than 2gb were allocated, I don't think that this is using that much memory (maybe its a misconfiguration).
          Then we should take a look at the module splitting. Currently the server package is just on top of API/Core, which is in trunk solely core.
          Although I think it is not a bad design, we have several new classes for the YARN stuff e.G. the YARNBSPJob. I don't know if we really should integrate YARN into our BSPJob.
          Then we have to catch up the sources to the current trunk.

          BTW you can build the server module with mvn install package and use the shaded jar to run on yarn with: yarn/bin/yarn jar <JAR>.jar org.apache.hama.bsp.YarnSerializePrinting

          It is currently not working as expected, I'm still facing some conf and classpath issues. But I hope I'll finish them today.

          Show
          Thomas Jungblut added a comment - Okay I fixed the last issues. I would now open some subtasks. For example to cleanup the TODOs or test with a distributed YARN (currently working with pseudo distribution). Especially profiling the ApplicationMaster would be a task too. Its container will be killed if less than 2gb were allocated, I don't think that this is using that much memory (maybe its a misconfiguration). Then we should take a look at the module splitting. Currently the server package is just on top of API/Core, which is in trunk solely core. Although I think it is not a bad design, we have several new classes for the YARN stuff e.G. the YARNBSPJob. I don't know if we really should integrate YARN into our BSPJob. Then we have to catch up the sources to the current trunk. BTW you can build the server module with mvn install package and use the shaded jar to run on yarn with: yarn/bin/yarn jar <JAR>.jar org.apache.hama.bsp.YarnSerializePrinting It is currently not working as expected, I'm still facing some conf and classpath issues. But I hope I'll finish them today.
          Hide
          Thomas Jungblut added a comment -

          Hello BSP on YARN

          So for me the first snapshot is done. We can go back to our patch review process again

          Show
          Thomas Jungblut added a comment - Hello BSP on YARN So for me the first snapshot is done. We can go back to our patch review process again
          Hide
          Thomas Jungblut added a comment -

          I'm proposing follow up issues:

          • fix TODOs
          • integrate the SuperSteps in verbose mode of the YARNBSPJob (not working properly somehow)
          • test Hama-YARN in fully distributed mode (check if configurations and jars are getting copied correctly)
          • check examples compatibility with YARN
          • reintegrate to trunk (I need some opinions on that please, split modules etc)
          • integrate checkpointing again
          • run findbugs over it
          • refactor BSPPeer so that we have an abstract version of it which runs on YARN and Hama
          • LOTS of testcases

          That is quite a lot, but they are very small tasks. So this could be done very fast.

          Show
          Thomas Jungblut added a comment - I'm proposing follow up issues: fix TODOs integrate the SuperSteps in verbose mode of the YARNBSPJob (not working properly somehow) test Hama-YARN in fully distributed mode (check if configurations and jars are getting copied correctly) check examples compatibility with YARN reintegrate to trunk (I need some opinions on that please, split modules etc) integrate checkpointing again run findbugs over it refactor BSPPeer so that we have an abstract version of it which runs on YARN and Hama LOTS of testcases That is quite a lot, but they are very small tasks. So this could be done very fast.
          Hide
          Edward J. Yoon added a comment -

          Thanks, yes. I don't really know if it works better than Zookeeper.

          If there's no big benefits, Zookeeper is better I think. Whatever we chose, it should be designed as a common module to provide sync service so that we can improve both our own cluster version and YARN version.

          Show
          Edward J. Yoon added a comment - Thanks, yes. I don't really know if it works better than Zookeeper. If there's no big benefits, Zookeeper is better I think. Whatever we chose, it should be designed as a common module to provide sync service so that we can improve both our own cluster version and YARN version.
          Hide
          Thomas Jungblut added a comment -

          Well, like I already told you in the other issue:
          It is better readable and clearer code in BSPPeer, 1 line vs 15 with synchronization stuff. And it may be faster, because it has less overhead.
          And there is no additional process to spawn.

          But I'm not aware of if it has the same issues like zookeeper has. We'll just have to test this. From the code I run, it works totally fine, but I don't own a 1k node cluster to test this So there might be scalabilty and stability issues.

          Show
          Thomas Jungblut added a comment - Well, like I already told you in the other issue: It is better readable and clearer code in BSPPeer, 1 line vs 15 with synchronization stuff. And it may be faster, because it has less overhead. And there is no additional process to spawn. But I'm not aware of if it has the same issues like zookeeper has. We'll just have to test this. From the code I run, it works totally fine, but I don't own a 1k node cluster to test this So there might be scalabilty and stability issues.
          Hide
          Edward J. Yoon added a comment -

          As you know, Zookeeper provides

          • Fault-Tolerance
          • Scalability (1,000+ clients per cell)
          • High Performance
          • Easy to use

          And, it's already verified. I would like to suggest that we need to investigate more about FT, HA, and split-brain issues.

          And again, whatever we chose, it should be designed as a common module.

          Show
          Edward J. Yoon added a comment - As you know, Zookeeper provides Fault-Tolerance Scalability (1,000+ clients per cell) High Performance Easy to use And, it's already verified. I would like to suggest that we need to investigate more about FT, HA, and split-brain issues. And again, whatever we chose, it should be designed as a common module.
          Hide
          Thomas Jungblut added a comment -

          You're right.
          Note that we receive fault tolerance in YARN sync because it is part of the app master, it can simply be restarted.
          And easy to use is a joke isn't it?

          This:

          protected boolean enterBarrier() throws KeeperException, InterruptedException {
              if (LOG.isDebugEnabled()) {
                LOG.debug("[" + getPeerName() + "] enter the enterbarrier: "
                    + this.getSuperstepCount());
              }
          
              synchronized (zk) {
                createZnode(bspRoot);
                final String pathToJobIdZnode = bspRoot + "/"
                    + taskid.getJobID().toString();
                createZnode(pathToJobIdZnode);
                final String pathToSuperstepZnode = pathToJobIdZnode + "/"
                    + getSuperstepCount();
                createZnode(pathToSuperstepZnode);
                BarrierWatcher barrierWatcher = new BarrierWatcher();
                Stat readyStat = zk.exists(pathToSuperstepZnode + "/ready",
                    barrierWatcher);
                zk.create(getNodeName(), null, Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
          
                List<String> znodes = zk.getChildren(pathToSuperstepZnode, false);
                int size = znodes.size(); // may contains ready
                boolean hasReady = znodes.contains("ready");
                if (hasReady) {
                  size--;
                }
          
                LOG.debug("===> at superstep :" + getSuperstepCount()
                    + " current znode size: " + znodes.size() + " current znodes:"
                    + znodes);
          
                if (LOG.isDebugEnabled())
                  LOG.debug("enterBarrier() znode size within " + pathToSuperstepZnode
                      + " is " + znodes.size() + ". Znodes include " + znodes);
          
                if (size < jobConf.getNumBspTask()) {
                  LOG.info("xxxx 1. At superstep: " + getSuperstepCount()
                      + " which task is waiting? " + taskid.toString()
                      + " stat is null? " + readyStat);
                  while (!barrierWatcher.isComplete()) {
                    if (!hasReady) {
                      synchronized (mutex) {
                        mutex.wait(1000);
                      }
                    }
                  }
                  LOG.debug("xxxx 2. at superstep: " + getSuperstepCount()
                      + " after waiting ..." + taskid.toString());
                } else {
                  LOG.debug("---> at superstep: " + getSuperstepCount()
                      + " task that is creating /ready znode:" + taskid.toString());
                  createEphemeralZnode(pathToSuperstepZnode + "/ready");
                }
              }
              return true;
            }
          

          is just a total not-easy to use way to use zookeeper at all. And it is not working correctly without throwing exections the whole time.
          Even if you take the log aside it is just a concurrency nightmare.

          And again, whatever we chose, it should be designed as a common module.
          

          I suggest to make the BSPPeer (or BSPPeerImpl what ever it is called now) an abstract class and subclass a ZooKeeper sync peer and a RPC Sync peer. Let the user decide.
          I think this is just a discussion between >I< don't like ZooKeeper and all other projects use it. It is not something which will lead us towards a solution anyways.

          Show
          Thomas Jungblut added a comment - You're right. Note that we receive fault tolerance in YARN sync because it is part of the app master, it can simply be restarted. And easy to use is a joke isn't it? This: protected boolean enterBarrier() throws KeeperException, InterruptedException { if (LOG.isDebugEnabled()) { LOG.debug("[" + getPeerName() + "] enter the enterbarrier: " + this.getSuperstepCount()); } synchronized (zk) { createZnode(bspRoot); final String pathToJobIdZnode = bspRoot + "/" + taskid.getJobID().toString(); createZnode(pathToJobIdZnode); final String pathToSuperstepZnode = pathToJobIdZnode + "/" + getSuperstepCount(); createZnode(pathToSuperstepZnode); BarrierWatcher barrierWatcher = new BarrierWatcher(); Stat readyStat = zk.exists(pathToSuperstepZnode + "/ready", barrierWatcher); zk.create(getNodeName(), null, Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); List<String> znodes = zk.getChildren(pathToSuperstepZnode, false); int size = znodes.size(); // may contains ready boolean hasReady = znodes.contains("ready"); if (hasReady) { size--; } LOG.debug("===> at superstep :" + getSuperstepCount() + " current znode size: " + znodes.size() + " current znodes:" + znodes); if (LOG.isDebugEnabled()) LOG.debug("enterBarrier() znode size within " + pathToSuperstepZnode + " is " + znodes.size() + ". Znodes include " + znodes); if (size < jobConf.getNumBspTask()) { LOG.info("xxxx 1. At superstep: " + getSuperstepCount() + " which task is waiting? " + taskid.toString() + " stat is null? " + readyStat); while (!barrierWatcher.isComplete()) { if (!hasReady) { synchronized (mutex) { mutex.wait(1000); } } } LOG.debug("xxxx 2. at superstep: " + getSuperstepCount() + " after waiting ..." + taskid.toString()); } else { LOG.debug("---> at superstep: " + getSuperstepCount() + " task that is creating /ready znode:" + taskid.toString()); createEphemeralZnode(pathToSuperstepZnode + "/ready"); } } return true; } is just a total not-easy to use way to use zookeeper at all. And it is not working correctly without throwing exections the whole time. Even if you take the log aside it is just a concurrency nightmare. And again, whatever we chose, it should be designed as a common module. I suggest to make the BSPPeer (or BSPPeerImpl what ever it is called now) an abstract class and subclass a ZooKeeper sync peer and a RPC Sync peer. Let the user decide. I think this is just a discussion between >I< don't like ZooKeeper and all other projects use it. It is not something which will lead us towards a solution anyways.
          Hide
          Edward J. Yoon added a comment -

          I like the fact that I can add quorum servers simply. Anyway, the sync service is a totally another issue from YARN. So, if we design it as a common module, we can improve it later if needed.

          What do you think?

          Show
          Edward J. Yoon added a comment - I like the fact that I can add quorum servers simply. Anyway, the sync service is a totally another issue from YARN. So, if we design it as a common module, we can improve it later if needed. What do you think?
          Hide
          Thomas Jungblut added a comment -

          What do you want to put into it?

          Show
          Thomas Jungblut added a comment - What do you want to put into it?
          Hide
          Edward J. Yoon added a comment -

          Do you think is it possible to use just BSPPeer in Task runner? If so, let's do like that at the moment.

          If we can continue to maintain low complexity of syncServer and no big differences about performance, let's get rid of zookeeper.

          Show
          Edward J. Yoon added a comment - Do you think is it possible to use just BSPPeer in Task runner? If so, let's do like that at the moment. If we can continue to maintain low complexity of syncServer and no big differences about performance, let's get rid of zookeeper.
          Hide
          Thomas Jungblut added a comment -

          Sorry, which BSPPeer in which Taskrunner?
          What is your opinion on the abstract version of BSPPeer? Which just adds abstract methods for enter/leave-barrier and getAllPeerNames?
          This would be quite enough.

          big differences about performance
          

          How should we measure this?

          Show
          Thomas Jungblut added a comment - Sorry, which BSPPeer in which Taskrunner? What is your opinion on the abstract version of BSPPeer? Which just adds abstract methods for enter/leave-barrier and getAllPeerNames? This would be quite enough. big differences about performance How should we measure this?
          Hide
          Thomas Jungblut added a comment -

          re-integration patch to the trunk

          just some minor refactoring in names of bsppeer and impl stuff.

          Please apply this to rev 1182452 and have a look if everything is fine again. Then we can go for patches again and sort the sync/checkpoint issues via mailing list.

          Show
          Thomas Jungblut added a comment - re-integration patch to the trunk just some minor refactoring in names of bsppeer and impl stuff. Please apply this to rev 1182452 and have a look if everything is fine again. Then we can go for patches again and sort the sync/checkpoint issues via mailing list.
          Hide
          Thomas Jungblut added a comment -

          tiny fixes and renames. This works

          Show
          Thomas Jungblut added a comment - tiny fixes and renames. This works
          Hide
          Thomas Jungblut added a comment -

          New patch to catch up to trunk.

          To apply you should first do:

          svn move core/src/main/java/org/apache/hama/bsp/BSPPeer.java core/src/main/java/org/apache/hama/bsp/BSPPeerImpl.java
          

          and

          svn move core/src/main/java/org/apache/hama/bsp/BSPPeerInterface.java core/src/main/java/org/apache/hama/bsp/BSPPeer.java
          

          Then you can safely apply the patch, it will ask where the core/src/main/java/org/apache/hama/bsp/BSPPeerInterface.java is gone, but just ignore the message.

          Then do a

          mvn clean install package

          and

          mvn eclipse:eclipse

          and it should be just fine.

          Can someone please check this in the near future? I don't want to keep this updating up..

          Show
          Thomas Jungblut added a comment - New patch to catch up to trunk. To apply you should first do: svn move core/src/main/java/org/apache/hama/bsp/BSPPeer.java core/src/main/java/org/apache/hama/bsp/BSPPeerImpl.java and svn move core/src/main/java/org/apache/hama/bsp/BSPPeerInterface.java core/src/main/java/org/apache/hama/bsp/BSPPeer.java Then you can safely apply the patch, it will ask where the core/src/main/java/org/apache/hama/bsp/BSPPeerInterface.java is gone, but just ignore the message. Then do a mvn clean install package and mvn eclipse:eclipse and it should be just fine. Can someone please check this in the near future? I don't want to keep this updating up..
          Hide
          Edward J. Yoon added a comment -

          Could you please remove tab spaces in yarn/pom.xml file?

          Let's put this into trunk.

          Show
          Edward J. Yoon added a comment - Could you please remove tab spaces in yarn/pom.xml file? Let's put this into trunk.
          Hide
          Thomas Jungblut added a comment -

          oh crap. That was the eclipse formatter. I'll fix that.
          I'm going to commit this then.

          Show
          Thomas Jungblut added a comment - oh crap. That was the eclipse formatter. I'll fix that. I'm going to commit this then.
          Hide
          Thomas Jungblut added a comment -

          Great, branch is deleted and it is committed.

          Next steps:
          -fix TODOs
          -integrate the SuperSteps in verbose mode of the YARNBSPJob (not working properly somehow)
          -test Hama-YARN in fully distributed mode (check if configurations and jars are getting copied correctly)
          -add YARN serialize printing to examples package and make it dependend on YARN module.

          Minor things:
          -findbugs run, add target and .*files to SVN ignore.

          Things to discuss:
          BSPPeer problems
          -Zookeeper yes/no/yes
          -Checkpointing as a seperate process?
          general
          -module layout -> client/api/common/core yes/no/who wants to refactor this?

          I'm going to create subtasks for the next steps and minor things. The discussion part must be transfered to the mailing list.

          Show
          Thomas Jungblut added a comment - Great, branch is deleted and it is committed. Next steps: -fix TODOs -integrate the SuperSteps in verbose mode of the YARNBSPJob (not working properly somehow) -test Hama-YARN in fully distributed mode (check if configurations and jars are getting copied correctly) -add YARN serialize printing to examples package and make it dependend on YARN module. Minor things: -findbugs run, add target and .*files to SVN ignore. Things to discuss: BSPPeer problems -Zookeeper yes/no/yes -Checkpointing as a seperate process? general -module layout -> client/api/common/core yes/no/who wants to refactor this? I'm going to create subtasks for the next steps and minor things. The discussion part must be transfered to the mailing list.
          Hide
          Edward J. Yoon added a comment -

          After svn update, BSPPeer.getBSPPeerConnection() method always return null on my 16 nodes hama cluster.

          Will you fix this problem?

          Show
          Edward J. Yoon added a comment - After svn update, BSPPeer.getBSPPeerConnection() method always return null on my 16 nodes hama cluster. Will you fix this problem?
          Hide
          Thomas Jungblut added a comment -

          Like already told in Talk, I can't think of a problem here.
          Let's see the diff:

          Index: core/src/main/java/org/apache/hama/bsp/BSPPeerImpl.java
          ===================================================================
          --- core/src/main/java/org/apache/hama/bsp/BSPPeerImpl.java	(Revision 1182784)
          +++ core/src/main/java/org/apache/hama/bsp/BSPPeerImpl.java	(Arbeitskopie)
          @@ -59,9 +59,9 @@
           /**
            * This class represents a BSP peer.
            */
          -public class BSPPeer implements Watcher, BSPPeerInterface {
          +public class BSPPeerImpl implements Watcher, BSPPeer {
           
          -  public static final Log LOG = LogFactory.getLog(BSPPeer.class);
          +  public static final Log LOG = LogFactory.getLog(BSPPeerImpl.class);
           
             private final Configuration conf;
             private BSPJob jobConf;
          @@ -73,7 +73,7 @@
             private final String bspRoot;
             private final String quorumServers;
           
          -  private final Map<InetSocketAddress, BSPPeerInterface> peers = new ConcurrentHashMap<InetSocketAddress, BSPPeerInterface>();
          +  private final Map<InetSocketAddress, BSPPeer> peers = new ConcurrentHashMap<InetSocketAddress, BSPPeer>();
             private final Map<InetSocketAddress, ConcurrentLinkedQueue<BSPMessage>> outgoingQueues = new ConcurrentHashMap<InetSocketAddress, ConcurrentLinkedQueue<BSPMessage>>();
             private ConcurrentLinkedQueue<BSPMessage> localQueue = new ConcurrentLinkedQueue<BSPMessage>();
             private ConcurrentLinkedQueue<BSPMessage> localQueueForNextIteration = new ConcurrentLinkedQueue<BSPMessage>();
          @@ -192,7 +192,7 @@
             /**
              * Protected default constructor for LocalBSPRunner.
              */
          -  protected BSPPeer() {
          +  protected BSPPeerImpl() {
               bspRoot = null;
               quorumServers = null;
               messageSerializer = null;
          @@ -208,7 +208,7 @@
              * @param umbilical is the bsp protocol used to contact its parent process.
              * @param taskid is the id that current process holds.
              */
          -  public BSPPeer(Configuration conf, TaskAttemptID taskid,
          +  public BSPPeerImpl(Configuration conf, TaskAttemptID taskid,
                 BSPPeerProtocol umbilical) throws IOException {
               this.conf = conf;
               this.taskid = taskid;
          @@ -312,7 +312,7 @@
                 Entry<InetSocketAddress, ConcurrentLinkedQueue<BSPMessage>> entry = it
                     .next();
           
          -      BSPPeerInterface peer = peers.get(entry.getKey());
          +      BSPPeer peer = peers.get(entry.getKey());
                 if (peer == null) {
                   try {
                     peer = getBSPPeerConnection(entry.getKey());
          @@ -587,19 +587,19 @@
           
             @Override
             public long getProtocolVersion(String arg0, long arg1) throws IOException {
          -    return BSPPeerInterface.versionID;
          +    return BSPPeer.versionID;
             }
           
          -  protected BSPPeerInterface getBSPPeerConnection(InetSocketAddress addr)
          +  protected BSPPeer getBSPPeerConnection(InetSocketAddress addr)
                 throws NullPointerException, IOException {
          -    BSPPeerInterface peer;
          +    BSPPeer peer;
               synchronized (this.peers) {
                 peer = peers.get(addr);
           
                 int retries = 0;
                 while (peer != null) {
          -        peer = (BSPPeerInterface) RPC.getProxy(BSPPeerInterface.class,
          -            BSPPeerInterface.versionID, addr, this.conf);
          +        peer = (BSPPeer) RPC.getProxy(BSPPeer.class,
          +            BSPPeer.versionID, addr, this.conf);
           
                   retries++;
                   if (retries > 10) {
          

          As you can see, this is just a simple renaming action.

          Show
          Thomas Jungblut added a comment - Like already told in Talk, I can't think of a problem here. Let's see the diff: Index: core/src/main/java/org/apache/hama/bsp/BSPPeerImpl.java =================================================================== --- core/src/main/java/org/apache/hama/bsp/BSPPeerImpl.java (Revision 1182784) +++ core/src/main/java/org/apache/hama/bsp/BSPPeerImpl.java (Arbeitskopie) @@ -59,9 +59,9 @@ /** * This class represents a BSP peer. */ -public class BSPPeer implements Watcher, BSPPeerInterface { +public class BSPPeerImpl implements Watcher, BSPPeer { - public static final Log LOG = LogFactory.getLog(BSPPeer.class); + public static final Log LOG = LogFactory.getLog(BSPPeerImpl.class); private final Configuration conf; private BSPJob jobConf; @@ -73,7 +73,7 @@ private final String bspRoot; private final String quorumServers; - private final Map<InetSocketAddress, BSPPeerInterface> peers = new ConcurrentHashMap<InetSocketAddress, BSPPeerInterface>(); + private final Map<InetSocketAddress, BSPPeer> peers = new ConcurrentHashMap<InetSocketAddress, BSPPeer>(); private final Map<InetSocketAddress, ConcurrentLinkedQueue<BSPMessage>> outgoingQueues = new ConcurrentHashMap<InetSocketAddress, ConcurrentLinkedQueue<BSPMessage>>(); private ConcurrentLinkedQueue<BSPMessage> localQueue = new ConcurrentLinkedQueue<BSPMessage>(); private ConcurrentLinkedQueue<BSPMessage> localQueueForNextIteration = new ConcurrentLinkedQueue<BSPMessage>(); @@ -192,7 +192,7 @@ /** * Protected default constructor for LocalBSPRunner. */ - protected BSPPeer() { + protected BSPPeerImpl() { bspRoot = null; quorumServers = null; messageSerializer = null; @@ -208,7 +208,7 @@ * @param umbilical is the bsp protocol used to contact its parent process. * @param taskid is the id that current process holds. */ - public BSPPeer(Configuration conf, TaskAttemptID taskid, + public BSPPeerImpl(Configuration conf, TaskAttemptID taskid, BSPPeerProtocol umbilical) throws IOException { this.conf = conf; this.taskid = taskid; @@ -312,7 +312,7 @@ Entry<InetSocketAddress, ConcurrentLinkedQueue<BSPMessage>> entry = it .next(); - BSPPeerInterface peer = peers.get(entry.getKey()); + BSPPeer peer = peers.get(entry.getKey()); if (peer == null) { try { peer = getBSPPeerConnection(entry.getKey()); @@ -587,19 +587,19 @@ @Override public long getProtocolVersion(String arg0, long arg1) throws IOException { - return BSPPeerInterface.versionID; + return BSPPeer.versionID; } - protected BSPPeerInterface getBSPPeerConnection(InetSocketAddress addr) + protected BSPPeer getBSPPeerConnection(InetSocketAddress addr) throws NullPointerException, IOException { - BSPPeerInterface peer; + BSPPeer peer; synchronized (this.peers) { peer = peers.get(addr); int retries = 0; while (peer != null) { - peer = (BSPPeerInterface) RPC.getProxy(BSPPeerInterface.class, - BSPPeerInterface.versionID, addr, this.conf); + peer = (BSPPeer) RPC.getProxy(BSPPeer.class, + BSPPeer.versionID, addr, this.conf); retries++; if (retries > 10) { As you can see, this is just a simple renaming action.
          Hide
          Edward J. Yoon added a comment -

          Tested after removing while loop. and works well. But I don't know why.. (- _-

          I'm committing that code at the moment.

          Show
          Edward J. Yoon added a comment - Tested after removing while loop. and works well. But I don't know why.. (- _- I'm committing that code at the moment.
          Hide
          Hudson added a comment -

          Integrated in Hama-Nightly #328 (See https://builds.apache.org/job/Hama-Nightly/328/)
          HAMA-431 integration of the branch for YARN.

          tjungblut :
          Files :

          • /incubator/hama/trunk/CHANGES.txt
          • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPPeer.java
          • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPPeerImpl.java
          • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPPeerInterface.java
          • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPTask.java
          • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java
          • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/LocalBSPRunner.java
          • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/Task.java
          • /incubator/hama/trunk/core/src/main/java/org/apache/hama/checkpoint/Checkpointer.java
          • /incubator/hama/trunk/core/src/test/java/org/apache/hama/bsp/BSPSerializerWrapper.java
          • /incubator/hama/trunk/core/src/test/java/org/apache/hama/checkpoint/TestCheckpoint.java
          • /incubator/hama/trunk/pom.xml
          • /incubator/hama/trunk/yarn
          • /incubator/hama/trunk/yarn/pom.xml
          • /incubator/hama/trunk/yarn/src
          • /incubator/hama/trunk/yarn/src/main
          • /incubator/hama/trunk/yarn/src/main/java
          • /incubator/hama/trunk/yarn/src/main/java/org
          • /incubator/hama/trunk/yarn/src/main/java/org/apache
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/BSPApplicationMaster.java
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/BSPClient.java
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/BSPRunner.java
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/BSPTaskLauncher.java
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/Job.java
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/JobImpl.java
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/YARNBSPJob.java
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/YARNBSPPeerImpl.java
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/YarnSerializePrinting.java
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/sync
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/sync/StringArrayWritable.java
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/sync/SyncServer.java
          • /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/sync/SyncServerImpl.java
          • /incubator/hama/trunk/yarn/src/main/resources
          • /incubator/hama/trunk/yarn/src/main/resources/log4j.properties
          Show
          Hudson added a comment - Integrated in Hama-Nightly #328 (See https://builds.apache.org/job/Hama-Nightly/328/ ) HAMA-431 integration of the branch for YARN. tjungblut : Files : /incubator/hama/trunk/CHANGES.txt /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPPeer.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPPeerImpl.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPPeerInterface.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPTask.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/LocalBSPRunner.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/Task.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/checkpoint/Checkpointer.java /incubator/hama/trunk/core/src/test/java/org/apache/hama/bsp/BSPSerializerWrapper.java /incubator/hama/trunk/core/src/test/java/org/apache/hama/checkpoint/TestCheckpoint.java /incubator/hama/trunk/pom.xml /incubator/hama/trunk/yarn /incubator/hama/trunk/yarn/pom.xml /incubator/hama/trunk/yarn/src /incubator/hama/trunk/yarn/src/main /incubator/hama/trunk/yarn/src/main/java /incubator/hama/trunk/yarn/src/main/java/org /incubator/hama/trunk/yarn/src/main/java/org/apache /incubator/hama/trunk/yarn/src/main/java/org/apache/hama /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/BSPApplicationMaster.java /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/BSPClient.java /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/BSPRunner.java /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/BSPTaskLauncher.java /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/Job.java /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/JobImpl.java /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/YARNBSPJob.java /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/YARNBSPPeerImpl.java /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/YarnSerializePrinting.java /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/sync /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/sync/StringArrayWritable.java /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/sync/SyncServer.java /incubator/hama/trunk/yarn/src/main/java/org/apache/hama/bsp/sync/SyncServerImpl.java /incubator/hama/trunk/yarn/src/main/resources /incubator/hama/trunk/yarn/src/main/resources/log4j.properties
          Hide
          Thomas Jungblut added a comment -

          made this to an umbrella and unassign.

          Show
          Thomas Jungblut added a comment - made this to an umbrella and unassign.
          Hide
          Edward J. Yoon added a comment -

          I'm schedule this to 0.6 roadmap.

          Show
          Edward J. Yoon added a comment - I'm schedule this to 0.6 roadmap.
          Hide
          Thomas Jungblut added a comment -

          Yes, we need to push this to the newest version and make it consistent to our latest release again.

          Show
          Thomas Jungblut added a comment - Yes, we need to push this to the newest version and make it consistent to our latest release again.
          Hide
          Edward J. Yoon added a comment -

          I just installed YARN on my test machines. Let's do this 0.7.

          Show
          Edward J. Yoon added a comment - I just installed YARN on my test machines. Let's do this 0.7.
          Hide
          Thomas Jungblut added a comment -

          We will continue on a new YARN jira.

          Show
          Thomas Jungblut added a comment - We will continue on a new YARN jira.

            People

            • Assignee:
              Edward J. Yoon
              Reporter:
              Thomas Jungblut
            • Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development