Uploaded image for project: 'Giraph (Retired)'
  1. Giraph (Retired)
  2. GIRAPH-154

Worker ports are not synched properly with its peers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.0
    • None
    • bsp
    • None

    Description

      When worker trying multiple ports to setup the rpc server, the final port is not synched with it's peer workers properly, and resulted in peer workers send message to the default port.

      Here is some logs:

      ############################################################################
      Base port: 34900
      ############################################################################

      ############################################################################
      log for worker 161:
      ############################################################################
      IPC Server handler 98 on 36061: starting
      BasicRPCCommunications: Started RPC communication server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:36061 with 100 handlers and 199 flush threads on bind attempt 1
      IPC Server handler 99 on 36061: starting
      setup: Registering health of this worker...
      getJobState: Job state already exists (/_hadoopBsp/job_201203130609_14838/_masterJobState)
      getApplicationAttempt: Node /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir already exists!
      getApplicationAttempt: Node /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir already exists!
      registerHealth: Created my health node for attempt=0, superstep=-1 with /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir/gsta32085.tan.ygrid.yahoo.com_161 and workerInfo= Worker(hostname=gsta32085.tan.ygrid.yahoo.com, MRpartition=161, port=35061)
      process: partitionAssignmentsReadyChanged (partitions are assigned)
      startSuperstep: Ready for computation on superstep -1 since worker selection and vertex range assignments are done in /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir/0/_superstepDir/-1/_partitionAssignments
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 0 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 1 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 2 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 3 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 4 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 5 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 6 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 7 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 8 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 9 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 10 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 11 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 12 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 13 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 14 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 15 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 16 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 17 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 18 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 19 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 20 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 21 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 22 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 23 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 24 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 25 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 26 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 27 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 28 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 29 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 30 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 31 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 32 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 33 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 34 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 35 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 36 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 37 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 38 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 39 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 40 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 41 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 42 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 43 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 44 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 45 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 46 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 47 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 48 time(s).
      Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 49 time(s).
      PriviledgedActionException as:job_201203130609_14838 (auth:SIMPLE) cause:java.net.ConnectException: Call to gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection exception: java.net.ConnectException: Connection refused
      connectAllRPCProxys: Failed on attempt 0 of 5 to connect to (id=33,cur=Worker(hostname=gsta32085.tan.ygrid.yahoo.com, MRpartition=161, port=35061),prev=null,ckpt_file=null)
      java.net.ConnectException: Call to gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection exception: java.net.ConnectException: Connection refused
      at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
      at org.apache.hadoop.ipc.Client.call(Client.java:1071)
      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
      at $Proxy8.getProtocolVersion(Unknown Source)
      at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
      at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:370)
      at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420)
      at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:159)
      at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:155)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082)
      at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:153)
      at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:51)
      at org.apache.giraph.comm.BasicRPCCommunications.startPeerConnectionThread(BasicRPCCommunications.java:599)
      at org.apache.giraph.comm.BasicRPCCommunications.connectAllRPCProxys(BasicRPCCommunications.java:542)
      at org.apache.giraph.comm.BasicRPCCommunications.setup(BasicRPCCommunications.java:513)
      at org.apache.giraph.graph.BspServiceWorker.setup(BspServiceWorker.java:550)
      at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:458)
      at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:630)
      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
      at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082)
      at org.apache.hadoop.mapred.Child.main(Child.java:249)
      Caused by: java.net.ConnectException: Connection refused
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
      at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
      at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656)
      at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
      at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
      at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
      at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202)
      at org.apache.hadoop.ipc.Client.call(Client.java:1046)
      ... 25 more

      ############################################################################
      log for worker 154
      ############################################################################
      PriviledgedActionException as:job_201203130609_14838 (auth:SIMPLE) cause:java.net.ConnectException: Call to gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection exception: java.net.ConnectException: Connection refused
      connectAllRPCProxys: Failed on attempt 4 of 5 to connect to (id=33,cur=Worker(hostname=gsta32085.tan.ygrid.yahoo.com, MRpartition=161, port=35061),prev=null,ckpt_file=null)
      java.net.ConnectException: Call to gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection exception: java.net.ConnectException: Connection refused
      at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
      at org.apache.hadoop.ipc.Client.call(Client.java:1071)
      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
      at $Proxy8.getProtocolVersion(Unknown Source)
      at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
      at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:370)
      at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420)
      at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:159)
      at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:155)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082)
      at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:153)
      at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:51)
      at org.apache.giraph.comm.BasicRPCCommunications.startPeerConnectionThread(BasicRPCCommunications.java:599)
      at org.apache.giraph.comm.BasicRPCCommunications.connectAllRPCProxys(BasicRPCCommunications.java:542)
      at org.apache.giraph.comm.BasicRPCCommunications.setup(BasicRPCCommunications.java:513)
      at org.apache.giraph.graph.BspServiceWorker.setup(BspServiceWorker.java:550)
      at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:458)
      at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:630)
      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
      at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082)
      at org.apache.hadoop.mapred.Child.main(Child.java:249)
      Caused by: java.net.ConnectException: Connection refused
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
      at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
      at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656)
      at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
      at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
      at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
      at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202)
      at org.apache.hadoop.ipc.Client.call(Client.java:1046)
      ... 25 more

      Attachments

        1. GIRAPH-154.patch
          9 kB
          Zhiwei Gu

        Activity

          People

            guzhiwei Zhiwei Gu
            guzhiwei Zhiwei Gu
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: