Long comment, the leisure of the weekend
Good to see the ball rolling.
I had a browsing session on the current HAMA code(let's call this HamaV1 code) and the mapreduce-integration branch (actually this should be Yarn-integration, let's call this HamaV2).
Some thoughts follow. Some of the following may be naive as I am new around here
Regarding the Job and Task state machines: Yes it does look like you don't need a lot of states and their corresponding transitions here, from what I can see from HamaV1 JobInProgress and TaskInProgress. Is that because you don't have good failure handling in HamaV1 (as I read in of the presentations)? It that isn't true, ignore what follows. Otherwise, I think it is the right time to think about fault tolerance (if at all) and write down the state machines to include the faulty scenarios.
Implementation of barrier synchronization: Not sure of the problems you ran with ZooKeeper in HamaV1, but can't we use the ApplicationMaster(AM) in HamaV2 as a barrier synchronization service? Each BSPPeer could periodically poll the AM if it can proceed to the next superstep. If and when the AM goes down, all the BSPPeers just wait there spinning till AM is restarted by the Yarn ResourceManager.
– Pros: Avoiding ZooKeeper frees BSP from the ZK external dependency, one less service needed for running HAMA apps.
– Cons: It robs HAMA of the the notification push vis ZK's watcher mechanism (notification push vs periodic pull) (This should be agreeable, no?).
Regarding use of MR classes:
- Reuse of MRV2 classes: I was appalled by the amount of Hadoop MapReduce code (kinda) forked in HamaV1. Glad that with Yarn and HamaV2, most of the forking will be gone. Still, one look at the HamaV2 code you have at Google Code tells me you are trying to mimic MRV2 (MapReduce over YARN) internals. IMO, that isn't needed as the Job, Task, TaskAttempt etc in MR have concepts specific to MapReduce like Map/Reduce tasks. I think we can redesign these objects needed for HAMA here relatively with far more ease. And that's cleaner too.
- Code reuse from MRV2: OTOH, I do clearly see that we should re-use MRV2 components like ContainerLauncher (launches containers on nodes), RMContainerAllocator(requests containers from ResourceManager), I'll see how we can move these to a separate common library module from MRV2 so that Hama(and possibly others) can use them.
Meta comment: Instead of jumping into writing the implementation, I think it helps to spend some time developing the design till it reaches some level of stability and then writing down the module structure(like BspAppMaster module, BspChild module etc.), followed by the interfaces of all the data objects and the components and finally wiring them together. Once we have all the interfaces and communication patterns in place, implementation can be done in parallel. It did help us writing MRV2 a lot cleaner, am sure it will help us here too.
General infra thought: I think having this branch at apache svn helps HAMA's incubation status. Also it will be easy for anyone else from the current hama-dev interested in working on this to use apache lists, svn etc. (Oh, BTW, I am looking for collaborating too ). What do you think?