[SYSTEMDS-2469] Large distributed paramserv overheads - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: SystemML 1.2
Component/s: None
Labels:
None

Epic Link:
Language and runtime for parameter servers

Description

Initial runs with the distributed paramserv implementation on a small cluster revealed that it is working correctly while exhibiting large overheads. Below are the stats for mnist lenet, 10 epochs, ASP, update per EPOCH, on a cluster of 1+6 nodes (24 cores per worker node).

otal elapsed time:             687.743 sec.
Total compilation time:         3.815 sec.
Total execution time:           683.928 sec.
Number of compiled Spark inst:  330.
Number of executed Spark inst:  0.
Cache hits (Mem, WB, FS, HDFS): 176210/0/0/2.
Cache writes (WB, FS, HDFS):    29856/5271/0.
Cache times (ACQr/m, RLS, EXP): 1.178/0.087/198.892/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/1629.
HOP DAGs recompile time:        4.878 sec.
Functions recompiled:           1.
Functions recompile time:       0.097 sec.
Spark ctx create time (lazy):   22.222 sec.
Spark trans counts (par,bc,col):2/1/0.
Spark trans times (par,bc,col): 0.390/0.242/0.000 secs.
Paramserv total num workers:    144.
Paramserv setup time:           68.259 secs.
Paramserv grad compute time:    6952.163 secs.
Paramserv model update time:    2453.448/422.955 secs.
Paramserv model broadcast time: 24.982 secs.
Paramserv batch slice time:     0.204 secs.
Paramserv RPC request time:     51611.210 secs.
ParFor loops optimized:         1.
ParFor optimize time:           0.462 sec.
ParFor initialize time:         0.049 sec.
ParFor result merge time:       0.028 sec.
ParFor total update in-place:   0/188/188
Total JIT compile time:         98.786 sec.
Total JVM GC count:             68.
Total JVM GC time:              25.858 sec.
Heavy hitter instructions:
  #  Instruction      Time(s)  Count
  1  paramserv        665.479      1
  2  +                182.410  18636
  3  conv2d_bias_add  150.938    376
  4  sqrt              69.768  11528
  5  /                 54.836  11732
  6  ba+*              45.901    376
  7  *                 38.046  11727
  8  -                 37.428  12096
  9  ^2                35.533   6344
 10  exp               21.022    188

There seem to be three distinct issues:

Too larger number of tasks on assembling the distributed input data (in the number of rows, i.e., >50,000 tasks), which makes the distributed data partitioning very slow (multiple minutes).
Evictions from the buffer pool at the driver node (see cache writes). This is likely due to disabling cleanup (and missing explicit cleanup) of all RPC objects.
Large RPC overhead: This might be due to the evictions happening in the critical path and all 144 workers waiting with their RPC requests. However, in addition we should also double check that the number of RPC handler threads is correct, if we could get the serialization and communication out of the critical (i.e., synchronized) path of model updates, and address unnecessary serialization/deserialization overheads.

Guobao I'll help reducing the serialization/deserialization overheads, but it would be great if you could have a look into the other issues.

Attachments

Activity

People

Assignee:: LI Guobao

Reporter:: Matthias Boehm

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 28/Jul/18 00:35

Updated:: 31/Jul/18 20:11

Resolved:: 31/Jul/18 20:11