Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34015

SparkR partition timing summary reports input time correctly



    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.2, 3.0.1
    • 3.1.1, 3.2.0
    • SparkR
    • None
    • Observed on CentOS-7 running spark 2.3.1 and on my mac running master

    • Patch


      When sparkR is run at log level INFO, a summary of how the worker spent its time processing the partition is printed. There is a logic error where it is over-reporting the time inputting rows.

      In detail: the variable inputElap in a wider context is used to mark the beginning of reading rows, but in the part changed here it was used as a local variable for measuring compute time. Thus, the error is not observable if there is only one group per partition, which is what you get in unit tests.

      For our application, here's what a log entry looks like before these changes were applied:

      20/10/09 04:08:58 WARN RRunner: Times: boot = 0.013 s, init = 0.005 s, broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output = 0.020 s, total = 1021.546 s

      this indicates that we're spending more time reading rows than operating on the rows.

      After these changes, it looks like this:

      20/12/15 06:43:29 WARN RRunner: Times: boot = 0.013 s, init = 0.010 s, broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output = 0.045 s, total = 1812.553 s




            WamBamBoozle Tom Howland
            WamBamBoozle Tom Howland
            hyukjin.kwon hyukjin.kwon
            0 Vote for this issue
            2 Start watching this issue