Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Initial runs with the distributed paramserv implementation on a small cluster revealed that it is working correctly while exhibiting large overheads. Below are the stats for mnist lenet, 10 epochs, ASP, update per EPOCH, on a cluster of 1+6 nodes (24 cores per worker node).
otal elapsed time: 687.743 sec. Total compilation time: 3.815 sec. Total execution time: 683.928 sec. Number of compiled Spark inst: 330. Number of executed Spark inst: 0. Cache hits (Mem, WB, FS, HDFS): 176210/0/0/2. Cache writes (WB, FS, HDFS): 29856/5271/0. Cache times (ACQr/m, RLS, EXP): 1.178/0.087/198.892/0.000 sec. HOP DAGs recompiled (PRED, SB): 0/1629. HOP DAGs recompile time: 4.878 sec. Functions recompiled: 1. Functions recompile time: 0.097 sec. Spark ctx create time (lazy): 22.222 sec. Spark trans counts (par,bc,col):2/1/0. Spark trans times (par,bc,col): 0.390/0.242/0.000 secs. Paramserv total num workers: 144. Paramserv setup time: 68.259 secs. Paramserv grad compute time: 6952.163 secs. Paramserv model update time: 2453.448/422.955 secs. Paramserv model broadcast time: 24.982 secs. Paramserv batch slice time: 0.204 secs. Paramserv RPC request time: 51611.210 secs. ParFor loops optimized: 1. ParFor optimize time: 0.462 sec. ParFor initialize time: 0.049 sec. ParFor result merge time: 0.028 sec. ParFor total update in-place: 0/188/188 Total JIT compile time: 98.786 sec. Total JVM GC count: 68. Total JVM GC time: 25.858 sec. Heavy hitter instructions: # Instruction Time(s) Count 1 paramserv 665.479 1 2 + 182.410 18636 3 conv2d_bias_add 150.938 376 4 sqrt 69.768 11528 5 / 54.836 11732 6 ba+* 45.901 376 7 * 38.046 11727 8 - 37.428 12096 9 ^2 35.533 6344 10 exp 21.022 188
There seem to be three distinct issues:
- Too larger number of tasks on assembling the distributed input data (in the number of rows, i.e., >50,000 tasks), which makes the distributed data partitioning very slow (multiple minutes).
- Evictions from the buffer pool at the driver node (see cache writes). This is likely due to disabling cleanup (and missing explicit cleanup) of all RPC objects.
- Large RPC overhead: This might be due to the evictions happening in the critical path and all 144 workers waiting with their RPC requests. However, in addition we should also double check that the number of RPC handler threads is correct, if we could get the serialization and communication out of the critical (i.e., synchronized) path of model updates, and address unnecessary serialization/deserialization overheads.
Guobao I'll help reducing the serialization/deserialization overheads, but it would be great if you could have a look into the other issues.