Details
Description
In the SQLShuffleWriteMetricsReporter.decRecordsWritten method, a call to a child accidentally decrements bytesWritten instead of recordsWritten:
override def decRecordsWritten(v: Long): Unit = {
metricsReporter.decBytesWritten(v)
_recordsWritten.set(_recordsWritten.value - v)
}
One of the situations where decRecordsWritten is called while reverting shuffle writes from failed/canceled tasks. Due to the mixup in these calls, the recordsWritten metric ends up being v records too high (since it wasn't decremented) and the bytesWritten metric ends up v records too low, causing some failed tasks' write metrics to look like
{"Shuffle Bytes Written":-2109,"Shuffle Write Time":2923270,"Shuffle Records Written":2109}
instead of
{"Shuffle Bytes Written":0,"Shuffle Write Time":2923270,"Shuffle Records Written":0}
I'll submit a fix for this.