[SPARK-22605] OutputMetrics empty for DataFrame writes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
None

Description

I am trying to use the SparkListener interface to hook up some custom monitoring for some of our critical jobs. Among the first metrics I would like is an output row count & size metric. I'm using PySpark and the Py4J interface to implement the listener.

I am able to see the recordsRead and bytesRead metrics via the taskEnd.taskMetrics().inputMetrics().recordsRead() and .bytesRead() methods. taskEnd.taskMetrics().outputMetrics().recordsWritten() and .bytesWritten() are always 0. I see similar output if I use the stageCompleted event instead.

To trigger execution, I am using df.write.parquet(path). If I use df.rdd.saveAsTextFile(path) instead, the counts and bytes are correct.

Another clue that this bug is deeper in Spark SQL is that the Spark Application Master doesn't show the Output Size / Records column with df.write.parquet or df.write.text, but does with df.rdd.saveAsTextFile. Since the Spark Application Master also gets its output via the Listener interface, this would seem related.

There is a related PR: https://issues.apache.org/jira/browse/SPARK-21882, but I believe this to be a distinct issue.

Attachments

Issue Links

links to

[Github] Pull Request #19833 (cloud-fan)

Activity

People

Assignee:: Wenchen Fan

Reporter:: Jason White

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 24/Nov/17 22:42

Updated:: 14/Feb/22 08:42

Resolved:: 29/Nov/17 11:19