Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.0.0
-
None
-
None
-
uses hadoop 0.20.203.0 with 32 cluster nodes
Giraph release-1.0 pulled Oct. 29. 2013.
Description
When I run my code with out-of-core graph/message options OFF, it's fine. But when out-of-core graph/message options ON, then some workers give me exception messages like below and whole tasks suspends.
java.lang.IllegalStateException: run: Caught an unrecoverable exception waitFor: ExecutionException occurred while waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@3c7659ab at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) Caused by: java.lang.IllegalStateException: waitFor: ExecutionException occurred while waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@3c7659ab at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:181) at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:139) at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:124) at org.apache.giraph.utils.ProgressableUtils.getFutureResult(ProgressableUtils.java:87) at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:221) at org.apache.giraph.worker.BspServiceWorker.loadInputSplits(BspServiceWorker.java:283) at org.apache.giraph.worker.BspServiceWorker.loadVertices(BspServiceWorker.java:327) at org.apache.giraph.worker.BspServiceWorker.setup(BspServiceWorker.java:508) at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:246) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:91) ... 7 more Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: getOrCreatePartition: cannot retrieve partition 6 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262) at java.util.concurrent.FutureTask.get(FutureTask.java:119) at org.apache.giraph.utils.ProgressableUtils$FutureWaitable.waitFor(ProgressableUtils.java:300) at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:173) ... 16 more Caused by: java.lang.IllegalStateException: getOrCreatePartition: cannot retrieve partition 6 at org.apache.giraph.partition.DiskBackedPartitionStore.getOrCreatePartition(DiskBackedPartitionStore.java:243) at org.apache.giraph.comm.requests.SendWorkerVerticesRequest.doRequest(SendWorkerVerticesRequest.java:110) at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.doRequest(NettyWorkerClientRequestProcessor.java:482) at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.sendVertexRequest(NettyWorkerClientRequestProcessor.java:276) at org.apache.giraph.worker.VertexInputSplitsCallable.readInputSplit(VertexInputSplitsCallable.java:172) at org.apache.giraph.worker.InputSplitsCallable.loadInputSplit(InputSplitsCallable.java:267) at org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:211) at org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:60) at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at org.apache.giraph.partition.DiskBackedPartitionStore.getOrCreatePartition(DiskBackedPartitionStore.java:228) ... 13 more Caused by: java.lang.NullPointerException at org.apache.giraph.partition.DiskBackedPartitionStore$GetPartition.call(DiskBackedPartitionStore.java:692) at org.apache.giraph.partition.DiskBackedPartitionStore$GetPartition.call(DiskBackedPartitionStore.java:658) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at org.apache.giraph.partition.DiskBackedPartitionStore$DirectExecutorService.execute(DiskBackedPartitionStore.java:972) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:132) ... 14 more
This exception occurs when superstep = -1.
Strange things are that i) when I give option to run the job with equal or less than 10 workers, or ii ) when I run one of the example codes in giraph-examples - particularly, SimpleShortestPath with 32 workers, the job finishes fine. The exceptions only occur when I run my own code with larger than 10 workers. Then it goes out of the way.
I found that there was a similar - yet as far as I know, the very same problem before in GIRAPH-462, but the issue is marked as 'Resolved' and 'Fixed'. Does this issue really fixed and am I just doing wrong?
My input size was 75 MBytes with about 1 million nodes but I tested and found this problem does not depends on the input sizes.