Uploaded image for project: 'Apache Celeborn'
  1. Apache Celeborn
  2. CELEBORN-1713

RpcTimeoutException should include RPC address in message

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.6.0
    • 0.6.0

    Description

      The message of `RpcTimeoutException` does not contain the RPC address in the message at present, which causes that the timeout problem is hard to troubleshooting for unknown rpc address.

      24/11/12 03:00:51 [Executor task launch worker for task 53432.0 in stage 0.0 (TID 53487)] ERROR Executor: Exception in task 53432.0 in stage 0.0 (TID 53487)
      org.apache.celeborn.common.rpc.RpcTimeoutException: Futures timed out after [120000 milliseconds]. This timeout is controlled by celeborn.rpc.lookupTimeout
      	at org.apache.celeborn.common.rpc.RpcTimeout.org$apache$celeborn$common$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:46)
      	at org.apache.celeborn.common.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:61)
      	at org.apache.celeborn.common.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:57)
      	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
      	at org.apache.celeborn.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
      	at org.apache.celeborn.common.rpc.RpcEnv.setupEndpointRefByAddr(RpcEnv.scala:106)
      	at org.apache.celeborn.common.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:114)
      	at org.apache.celeborn.client.ShuffleClientImpl.setupLifecycleManagerRef(ShuffleClientImpl.java:1759)
      	at org.apache.celeborn.client.ShuffleClient.get(ShuffleClient.java:89)
      	at org.apache.spark.shuffle.celeborn.SparkShuffleManager.getWriter(SparkShuffleManager.java:239)
      	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:100)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
      	at org.apache.spark.scheduler.Task.run(Task.scala:144)
      	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:598)
      	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1545)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:603)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120000 milliseconds]
      	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
      	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
      	at org.apache.celeborn.common.util.ThreadUtils$.awaitResult(ThreadUtils.scala:316)
      	at org.apache.celeborn.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:74)
      	... 15 more
      

      Therefore, `RpcTimeoutException` should include RPC address in message to help troubleshooting of timeout.

      Attachments

        Issue Links

          Activity

            People

              nicholasjiang Nicholas Jiang
              nicholasjiang Nicholas Jiang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h