Details
-
Sub-task
-
Status: Patch Available
-
Minor
-
Resolution: Unresolved
-
2.8.1
-
None
-
None
Description
It looks suspiciously like S3A isn't providing the partitioning data needed in listLocatedStatus and getFileBlockLocations() needed to break up a file by the blocksize. This will stop tools using the MRv1 APIS doing the partitioning properly if the input format isn't doing it own split logic.
FileInputFormat in MRv2 is a bit more configurable about input split calculation & will split up large files. but otherwise, the partitioning is being done more by the default values of the executing engine, rather than any config data from the filesystem about what its "block size" is,
NativeAzureFS does a better job; maybe that could be factored out to hadoop-common and reused?
Attachments
Attachments
Issue Links
- breaks
-
SPARK-22240 S3 CSV number of partitions incorrectly computed
- Resolved
- contains
-
HADOOP-15044 Wasb getFileBlockLocations() returns too many locations.
- Resolved
- is depended upon by
-
HADOOP-15132 Über-jira: WASB client phase III: roll-up for Hadoop 3.2
- Open
- is related to
-
HADOOP-12878 Impersonate hosts in s3a for better data locality handling
- Open
-
HADOOP-15000 s3a new getdefaultblocksize be called in getFileStatus which has not been implemented in s3afilesystem yet
- Open
-
HADOOP-15320 Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake
- Resolved
- relates to
-
HDFS-12831 HDFS throws FileNotFoundException on getFileBlockLocations(path-to-directory)
- Patch Available