Details
Description
When sparkR is run at log level INFO, a summary of how the worker spent its time processing the partition is printed. There is a logic error where it is over-reporting the time inputting rows.
In detail: the variable inputElap in a wider context is used to mark the beginning of reading rows, but in the part changed here it was used as a local variable for measuring compute time. Thus, the error is not observable if there is only one group per partition, which is what you get in unit tests.
For our application, here's what a log entry looks like before these changes were applied:
20/10/09 04:08:58 WARN RRunner: Times: boot = 0.013 s, init = 0.005 s, broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output = 0.020 s, total = 1021.546 s
this indicates that we're spending more time reading rows than operating on the rows.
After these changes, it looks like this:
20/12/15 06:43:29 WARN RRunner: Times: boot = 0.013 s, init = 0.010 s, broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output = 0.045 s, total = 1812.553 s