To get HAMA graph running from the attached archive please follow the below steps:
1) Unzip the archive in the root of a freshly checked out HAMA trunk.
2) Get fastutil (http://fastutil.dsi.unimi.it/) and put its jar under $
/lib. I needed to put two other dependencies there because I could not find them in a public maven repository (dsiutils and webgraph).
3) To run the example inlink counter hama-graph job, invoke:
java org.apache.hama.graph.GraphComputeRunner <path-to-partitioned-graph>/ <number-of-partitions> org.apache.hama.graph.examples.InlinkCountComputer <max-iterations>
All the dependencies need to be on the classpath including the used HAMA distribution's config directory. The partitioned graph needs to be accessible through NFS on all the cluster nodes.
A few thoughts on current implementation:
a) Graph input
Currently HipG's (www.cs.vu.nl/~ekr/hipg/) segmented graph loader is used for loading partitioned graphs. A conversion util is provided (org.apache.hama.graph.format.ConvertFromWebGraphASCII.java) which can convert from WebGraph's (http://webgraph.dsi.unimi.it/) ASCII format into the segmented graph format.
This mechanism should be replaced with a builder-like approach employed by e.g. GoldenOrb: based on some input specification a factory class dependent on the input format of the graph should load vertexes into memory.
A good approach would be letting workers load data-local segments of the graph and do a partition-worker assignment based on data-locality, after which loaded vertexes not belonging to one worker's partition(s) could be sent to the proper workers via network.
b) Message buffering
In the current implementation messages are grouped by vertexes after a new superstep starts, i.e., VertexComputerGeneric iterates through all messages recevied by the BSPPeer and puts them into the queue of respective vertexes.
This memory copying could be avoided by letting the BSP framework do the grouping while receiving the messages, i.e., by grouping messages by tag such that messages intended for the same vertex ID will have the same tag.
Currently nothing is output from the vertexes. However, it would be relatively easy to add, e.g., by requiring that the Vertex class implement Writable and serializing all vertexes at the end of certain supersteps. This would be a step towards fault-tolerance too.