Description
Today I tested fault tolerance function with RandBench. FT works fine but I just found that there is a bug in RandBench program.
[root@cluster-0 hama-0.7.0]# bin/hama jar hama-examples-0.7.0.jar bench 100 100 100 15/09/03 12:59:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/09/03 12:59:58 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 15/09/03 12:59:58 INFO bsp.BSPJobClient: Running job: job_201509031258_0002 15/09/03 13:00:01 INFO bsp.BSPJobClient: Current supersteps number: 0 15/09/03 13:00:22 INFO bsp.BSPJobClient: Current supersteps number: 2 15/09/03 13:00:26 INFO bsp.BSPJobClient: Current supersteps number: 5 15/09/03 13:00:29 INFO bsp.BSPJobClient: Current supersteps number: 11 15/09/03 13:00:32 INFO bsp.BSPJobClient: Current supersteps number: 16 15/09/03 13:00:35 INFO bsp.BSPJobClient: Current supersteps number: 21 15/09/03 13:00:38 INFO bsp.BSPJobClient: Current supersteps number: 28 15/09/03 13:00:41 INFO bsp.BSPJobClient: Current supersteps number: 35 15/09/03 13:00:44 INFO bsp.BSPJobClient: Current supersteps number: 42 15/09/03 13:00:47 INFO bsp.BSPJobClient: Current supersteps number: 49 15/09/03 13:00:50 INFO bsp.BSPJobClient: Current supersteps number: 56 15/09/03 13:02:05 INFO bsp.BSPJobClient: Current supersteps number: 0 15/09/03 13:02:08 INFO bsp.BSPJobClient: Current supersteps number: 56 15/09/03 13:02:11 INFO bsp.BSPJobClient: Current supersteps number: 0 15/09/03 13:02:20 INFO bsp.BSPJobClient: Current supersteps number: 57 15/09/03 13:02:23 INFO bsp.BSPJobClient: Current supersteps number: 61 15/09/03 13:02:26 INFO bsp.BSPJobClient: Current supersteps number: 67 15/09/03 13:02:29 INFO bsp.BSPJobClient: Current supersteps number: 72 15/09/03 13:02:32 INFO bsp.BSPJobClient: Current supersteps number: 77 15/09/03 13:02:35 INFO bsp.BSPJobClient: Current supersteps number: 84 15/09/03 13:02:38 INFO bsp.BSPJobClient: Current supersteps number: 91 15/09/03 13:02:41 INFO bsp.BSPJobClient: Current supersteps number: 97 15/09/03 13:02:44 INFO bsp.BSPJobClient: Current supersteps number: 106 15/09/03 13:02:47 INFO bsp.BSPJobClient: Current supersteps number: 113 15/09/03 13:02:50 INFO bsp.BSPJobClient: Current supersteps number: 125 15/09/03 13:02:53 INFO bsp.BSPJobClient: Current supersteps number: 134 15/09/03 13:02:56 INFO bsp.BSPJobClient: Current supersteps number: 144 15/09/03 13:02:59 INFO bsp.BSPJobClient: Current supersteps number: 152 15/09/03 13:03:02 INFO bsp.BSPJobClient: Current supersteps number: 156 15/09/03 13:03:05 INFO bsp.BSPJobClient: The total number of supersteps: 156 15/09/03 13:03:05 INFO bsp.BSPJobClient: Counters: 6 15/09/03 13:03:05 INFO bsp.BSPJobClient: org.apache.hama.bsp.JobInProgress$JobCounter 15/09/03 13:03:05 INFO bsp.BSPJobClient: SUPERSTEPS=156 15/09/03 13:03:05 INFO bsp.BSPJobClient: LAUNCHED_TASKS=160 15/09/03 13:03:05 INFO bsp.BSPJobClient: org.apache.hama.bsp.BSPPeerImpl$PeerCounter 15/09/03 13:03:05 INFO bsp.BSPJobClient: SUPERSTEP_SUM=24960 15/09/03 13:03:05 INFO bsp.BSPJobClient: TIME_IN_SYNC_MS=1943366 15/09/03 13:03:05 INFO bsp.BSPJobClient: TOTAL_MESSAGES_SENT=1600000 15/09/03 13:03:05 INFO bsp.BSPJobClient: TOTAL_MESSAGES_RECEIVED=1600000 Job Finished in 187.453 seconds
I ran with set the max iteration to 100. At 56 superstep, I killed one task manually and I checked that failed task has automatically recovered. By the way, the total num of supersteps was 156, not 100.
The reason is simple, i always starts from 0. To fix this issue, we have to set the i to (int) peer.getSuperstepCount().
public void bsp( BSPPeer<NullWritable, NullWritable, NullWritable, NullWritable, BytesWritable> peer) throws IOException, SyncException, InterruptedException { byte[] dummyData = new byte[sizeOfMsg]; String[] peers = peer.getAllPeerNames(); for (int i = 0; i < nSupersteps; i++) {
GraphJobRunner also have similar problem. When the task is relaunched, setup() method will be called. Below should be called only when initial phase.
long startTime = System.currentTimeMillis(); loadVertices(peer); LOG.info("Total time spent for loading vertices: " + (System.currentTimeMillis() - startTime) + " ms"); startTime = System.currentTimeMillis(); countGlobalVertexCount(peer); LOG.info("Total time spent for broadcasting global vertex count: " + (System.currentTimeMillis() - startTime) + " ms"); startTime = System.currentTimeMillis(); doInitialSuperstep(peer); LOG.info("Total time spent for initial superstep: " + (System.currentTimeMillis() - startTime) + " ms");