Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-19353 Über-jira: S3A Hadoop 3.4.2 features
  3. HADOOP-14943

Add common getFileBlockLocations() emulation for object stores, including S3A

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Patch Available
    • Minor
    • Resolution: Unresolved
    • 2.8.1
    • None
    • fs/s3
    • None

    Description

      It looks suspiciously like S3A isn't providing the partitioning data needed in listLocatedStatus and getFileBlockLocations() needed to break up a file by the blocksize. This will stop tools using the MRv1 APIS doing the partitioning properly if the input format isn't doing it own split logic.

      FileInputFormat in MRv2 is a bit more configurable about input split calculation & will split up large files. but otherwise, the partitioning is being done more by the default values of the executing engine, rather than any config data from the filesystem about what its "block size" is,

      NativeAzureFS does a better job; maybe that could be factored out to hadoop-common and reused?

      Attachments

        1. HADOOP-14943-004.patch
          19 kB
          Steve Loughran
        2. HADOOP-14943-003.patch
          19 kB
          Steve Loughran
        3. HADOOP-14943-002.patch
          20 kB
          Steve Loughran
        4. HADOOP-14943-002.patch
          20 kB
          Steve Loughran
        5. HADOOP-14943-001.patch
          1 kB
          Steve Loughran

        Issue Links

          Activity

            People

              stevel@apache.org Steve Loughran
              stevel@apache.org Steve Loughran
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: