Uploaded image for project: 'Hama'
  1. Hama
  2. HAMA-973

GraphJob and RandBench example works incorrectly when FT is enabled.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.7.0
    • 0.7.1
    • bsp core
    • None

    Description

      Today I tested fault tolerance function with RandBench. FT works fine but I just found that there is a bug in RandBench program.

      [root@cluster-0 hama-0.7.0]# bin/hama jar hama-examples-0.7.0.jar bench 100 100 100
      15/09/03 12:59:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      15/09/03 12:59:58 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
      15/09/03 12:59:58 INFO bsp.BSPJobClient: Running job: job_201509031258_0002
      15/09/03 13:00:01 INFO bsp.BSPJobClient: Current supersteps number: 0
      15/09/03 13:00:22 INFO bsp.BSPJobClient: Current supersteps number: 2
      15/09/03 13:00:26 INFO bsp.BSPJobClient: Current supersteps number: 5
      15/09/03 13:00:29 INFO bsp.BSPJobClient: Current supersteps number: 11
      15/09/03 13:00:32 INFO bsp.BSPJobClient: Current supersteps number: 16
      15/09/03 13:00:35 INFO bsp.BSPJobClient: Current supersteps number: 21
      15/09/03 13:00:38 INFO bsp.BSPJobClient: Current supersteps number: 28
      15/09/03 13:00:41 INFO bsp.BSPJobClient: Current supersteps number: 35
      15/09/03 13:00:44 INFO bsp.BSPJobClient: Current supersteps number: 42
      15/09/03 13:00:47 INFO bsp.BSPJobClient: Current supersteps number: 49
      15/09/03 13:00:50 INFO bsp.BSPJobClient: Current supersteps number: 56
      15/09/03 13:02:05 INFO bsp.BSPJobClient: Current supersteps number: 0
      15/09/03 13:02:08 INFO bsp.BSPJobClient: Current supersteps number: 56
      15/09/03 13:02:11 INFO bsp.BSPJobClient: Current supersteps number: 0
      15/09/03 13:02:20 INFO bsp.BSPJobClient: Current supersteps number: 57
      15/09/03 13:02:23 INFO bsp.BSPJobClient: Current supersteps number: 61
      15/09/03 13:02:26 INFO bsp.BSPJobClient: Current supersteps number: 67
      15/09/03 13:02:29 INFO bsp.BSPJobClient: Current supersteps number: 72
      15/09/03 13:02:32 INFO bsp.BSPJobClient: Current supersteps number: 77
      15/09/03 13:02:35 INFO bsp.BSPJobClient: Current supersteps number: 84
      15/09/03 13:02:38 INFO bsp.BSPJobClient: Current supersteps number: 91
      15/09/03 13:02:41 INFO bsp.BSPJobClient: Current supersteps number: 97
      15/09/03 13:02:44 INFO bsp.BSPJobClient: Current supersteps number: 106
      15/09/03 13:02:47 INFO bsp.BSPJobClient: Current supersteps number: 113
      15/09/03 13:02:50 INFO bsp.BSPJobClient: Current supersteps number: 125
      15/09/03 13:02:53 INFO bsp.BSPJobClient: Current supersteps number: 134
      15/09/03 13:02:56 INFO bsp.BSPJobClient: Current supersteps number: 144
      15/09/03 13:02:59 INFO bsp.BSPJobClient: Current supersteps number: 152
      15/09/03 13:03:02 INFO bsp.BSPJobClient: Current supersteps number: 156
      15/09/03 13:03:05 INFO bsp.BSPJobClient: The total number of supersteps: 156
      15/09/03 13:03:05 INFO bsp.BSPJobClient: Counters: 6
      15/09/03 13:03:05 INFO bsp.BSPJobClient:   org.apache.hama.bsp.JobInProgress$JobCounter
      15/09/03 13:03:05 INFO bsp.BSPJobClient:     SUPERSTEPS=156
      15/09/03 13:03:05 INFO bsp.BSPJobClient:     LAUNCHED_TASKS=160
      15/09/03 13:03:05 INFO bsp.BSPJobClient:   org.apache.hama.bsp.BSPPeerImpl$PeerCounter
      15/09/03 13:03:05 INFO bsp.BSPJobClient:     SUPERSTEP_SUM=24960
      15/09/03 13:03:05 INFO bsp.BSPJobClient:     TIME_IN_SYNC_MS=1943366
      15/09/03 13:03:05 INFO bsp.BSPJobClient:     TOTAL_MESSAGES_SENT=1600000
      15/09/03 13:03:05 INFO bsp.BSPJobClient:     TOTAL_MESSAGES_RECEIVED=1600000
      Job Finished in 187.453 seconds
      

      I ran with set the max iteration to 100. At 56 superstep, I killed one task manually and I checked that failed task has automatically recovered. By the way, the total num of supersteps was 156, not 100.

      The reason is simple, i always starts from 0. To fix this issue, we have to set the i to (int) peer.getSuperstepCount().

          public void bsp(
              BSPPeer<NullWritable, NullWritable, NullWritable, NullWritable, BytesWritable> peer)
              throws IOException, SyncException, InterruptedException {
            byte[] dummyData = new byte[sizeOfMsg];
            String[] peers = peer.getAllPeerNames();
      
            for (int i = 0; i < nSupersteps; i++) {
      

      GraphJobRunner also have similar problem. When the task is relaunched, setup() method will be called. Below should be called only when initial phase.

          long startTime = System.currentTimeMillis();
          loadVertices(peer);
          LOG.info("Total time spent for loading vertices: "
              + (System.currentTimeMillis() - startTime) + " ms");
      
          startTime = System.currentTimeMillis();
          countGlobalVertexCount(peer);
          LOG.info("Total time spent for broadcasting global vertex count: "
              + (System.currentTimeMillis() - startTime) + " ms");
      
          startTime = System.currentTimeMillis();
          doInitialSuperstep(peer);
          LOG.info("Total time spent for initial superstep: "
              + (System.currentTimeMillis() - startTime) + " ms");
      

      Attachments

        1. patch.txt
          9 kB
          Edward J. Yoon

        Activity

          People

            udanax Edward J. Yoon
            udanax Edward J. Yoon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: