[SPARK-21669] Internal API for collecting metrics/stats during FileFormatWriter jobs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
None

Description

It would be useful to have some infrastructure in place for collecting custom metrics or statistics on data on the fly, as it is being written to disk.

This was inspired by the work in ~~SPARK-20703~~, which added simple metrics collection for data write operations, such as numFiles, numPartitions, numRows. Those metrics are first collected on the executors and then sent to the driver, which aggregates and posts them as updates to the SQLMetrics subsystem.

The above can be generalized and turned into a pluggable interface, which in the future could be used for other purposes: e.g. automatic maintenance of cost-based optimizer (CBO) statistics during "INSERT INTO <table> SELECT ..." operations, such that users won't need to explicitly call "ANALYZE TABLE <table> COMPUTE STATISTICS" afterwards anymore, thus avoiding an extra full-table scan.

Attachments

Issue Links

is related to

SPARK-21762 FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible

Resolved

links to

Github

Activity

People

Assignee:: Adrian Ionescu

Reporter:: Adrian Ionescu

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Aug/17 13:51

Updated:: 17/Aug/17 19:41

Resolved:: 10/Aug/17 19:36