Uploaded image for project: 'Giraph (Retired)'
  1. Giraph (Retired)
  2. GIRAPH-46

Race condition on superstep 1 with RPC servers not started by the time that requests are sent

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 0.1.0
    • 0.1.0
    • None
    • None

    Description

      Hi,

      occasionally (maybe one time in four), my giraph run fails because of the below RuntimeException.
      According to code, it should never happen:

      if (msgMap == null)

      { // should never happen after constructor throw new RuntimeException( "sendMessage: msgMap did not exist for " + addr + " for vertex " + destVertex); }

      This happens during superstep 1 (second superstep). My application actually adds edges on superstep 1
      (to make every out-edge also an in-edge of the destination), but since I am running only on 3 workers,
      I am surprised if every worker would not had been registered in the RPC layer initially.

      One hypothesis is that Hadoop does something funny, because one of my server was under heavy
      load. Maybe Hadoop launched another worker to replace a slow worker? Can it happen?

      java.lang.RuntimeException: sendMessage: msgMap did not exist for [hostname].ml.cmu.edu:30003 for vertex 875713
      at org.apache.giraph.comm.BasicRPCCommunications.sendMessageReq(BasicRPCCommunications.java:825)
      at org.apache.giraph.graph.BasicVertex.sendMsg(BasicVertex.java:179)
      at edu.cmu.selectlab.BP.BinaryBPVertex.compute(BinaryBPVertex.java:94)
      at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:624)
      at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
      at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
      at org.apache.hadoop.mapred.Child.main(Child.java:253)

      Attachments

        1. diff.txt
          0.7 kB
          Avery Ching

        Activity

          People

            aching Avery Ching
            aching Avery Ching
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: