Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-16829 Über-jira: S3A Hadoop 3.3.1 features
  3. HADOOP-17414

Magic committer files don't have the count of bytes written collected by spark

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.2.0
    • Fix Version/s: 3.3.1
    • Component/s: fs/s3
    • Target Version/s:

      Description

      The spark statistics tracking doesn't correctly assess the size of the uploaded files as it only calls getFileStatus on the zero byte objects -not the yet-to-manifest files. Which, given they don't exist yet, isn't easy to do.

      Solution:

      • Add getXAttr and listXAttr API calls to S3AFileSystem
      • Return all S3 object headers as XAttr attributes prefixed "header." That's custom and standard (e.g header.Content-Length).

      The setXAttr call isn't implemented, so for correctness the FS doesn't
      declare its support for the API in hasPathCapability().

      The magic commit file write sets the custom header
      set the length of the data final data in the header
      x-hadoop-s3a-magic-data-length in the marker file.

      A matching patch in Spark will look for the XAttr
      "header.x-hadoop-s3a-magic-data-length" when the file
      being probed for output data is zero byte long.
      As a result, the job tracking statistics will report the
      bytes written but yet to be manifest.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                stevel@apache.org Steve Loughran
                Reporter:
                stevel@apache.org Steve Loughran
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 12h 20m
                  12h 20m