[SYSTEMDS-2478] Overhead when using parfor in update func - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Epic Link:
Language and runtime for parameter servers

Description

When using parfor inside update function, some MR tasks are launched to write the output of task. And it took more time to finish the paramserv run than without parfor in update function. The scenario is to launch the ASP Epoch DC spark paramserv test.
Here is the stack:

Total elapsed time:		101.804 sec.
Total compilation time:		3.690 sec.
Total execution time:		98.114 sec.
Number of compiled Spark inst:	302.
Number of executed Spark inst:	540.
Cache hits (Mem, WB, FS, HDFS):	57839/0/0/240.
Cache writes (WB, FS, HDFS):	14567/58/61.
Cache times (ACQr/m, RLS, EXP):	42.346/0.064/4.761/20.280 sec.
HOP DAGs recompiled (PRED, SB):	0/144.
HOP DAGs recompile time:	0.507 sec.
Functions recompiled:		16.
Functions recompile time:	0.064 sec.
Spark ctx create time (lazy):	1.376 sec.
Spark trans counts (par,bc,col):270/1/240.
Spark trans times (par,bc,col):	0.573/0.197/42.255 secs.
Paramserv total num workers:	3.
Paramserv setup time:		1.559 secs.
Paramserv grad compute time:	105.701 secs.
Paramserv model update time:	56.801/47.193 secs.
Paramserv model broadcast time:	23.872 secs.
Paramserv batch slice time:	0.000 secs.
Paramserv RPC request time:	105.159 secs.
ParFor loops optimized:		1.
ParFor optimize time:		0.040 sec.
ParFor initialize time:		0.434 sec.
ParFor result merge time:	0.005 sec.
ParFor total update in-place:	0/7/7
Total JIT compile time:		68.384 sec.
Total JVM GC count:		1120.
Total JVM GC time:		22.338 sec.
Heavy hitter instructions:
  #  Instruction             Time(s)  Count
  1  paramserv                97.221      1
  2  conv2d_bias_add          60.581    614
  3  *                        54.990  12447
  4  sp_-                     20.625    240
  5  -                        17.979   7287
  6  +                        14.191  12824
  7  r'                        5.636   1200
  8  conv2d_backward_filter    5.123    600
  9  max                       4.985    907
 10  ba+*                      4.591   1814

Here is the polished update func:

aggregation = function(list[unknown] model,
                       list[unknown] gradients,
                       list[unknown] hyperparams)
   return (list[unknown] modelResult) {
     lr = as.double(as.scalar(hyperparams["lr"]))
     mu = as.double(as.scalar(hyperparams["mu"]))

     modelResult = model

     # Optimize with SGD w/ Nesterov momentum
     parfor(i in 1:8, check=0) {
       P = as.matrix(model[i])
       dP = as.matrix(gradients[i])
       vP = as.matrix(model[8+i])
       [P, vP] = sgd_nesterov::update(P, dP, lr, mu, vP)
       modelResult[i] = P
       modelResult[8+i] = vP
     }
   }

mboehm7, in fact, I have no idea where the cause comes from? It seems that it tried to write the parfor task output into HDFS. So is it the normal behavior?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: LI Guobao

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Aug/18 17:04

Updated:: 01/Aug/18 22:47