[HADOOP-17414] Magic committer files don't have the count of bytes written collected by spark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.0
Fix Version/s: 3.3.1
Component/s: fs/s3
Labels:
- pull-request-available

Target Version/s:

3.3.1

Description

The spark statistics tracking doesn't correctly assess the size of the uploaded files as it only calls getFileStatus on the zero byte objects -not the yet-to-manifest files. Which, given they don't exist yet, isn't easy to do.

Solution:

Add getXAttr and listXAttr API calls to S3AFileSystem
Return all S3 object headers as XAttr attributes prefixed "header." That's custom and standard (e.g header.Content-Length).

The setXAttr call isn't implemented, so for correctness the FS doesn't
declare its support for the API in hasPathCapability().

The magic commit file write sets the custom header
set the length of the data final data in the header
x-hadoop-s3a-magic-data-length in the marker file.

A matching patch in Spark will look for the XAttr
"header.x-hadoop-s3a-magic-data-length" when the file
being probed for output data is zero byte long.
As a result, the job tracking statistics will report the
bytes written but yet to be manifest.

Attachments

Issue Links

blocks

SPARK-33739 Jobs committed through the S3A Magic committer don't report the bytes written

Resolved

links to

GitHub Pull Request #2530

Activity

People

Assignee:: Steve Loughran

Reporter:: Steve Loughran

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 07/Dec/20 14:28

Updated:: 26/Jan/21 19:47

Resolved:: 26/Jan/21 19:47

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

12h 20m