[HADOOP-15320] Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.7.3, 2.9.0, 3.0.0
Fix Version/s: 3.1.0, 2.9.1
Component/s: fs/adl, fs/azure
Labels:
None

Hadoop Flags:

Reviewed

Description

hadoop-azure and hadoop-azure-datalake have its own implementation of getFileBlockLocations(), which faked a list of artificial blocks based on the hard-coded block size. And each block has one host with name "localhost". Take a look at this code:

https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485

This is a unnecessary mock up for a "remote" file system to mimic HDFS. And the problem with this mock is that for large (~TB) files we generates lots of artificial blocks, and FileInputFormat.getSplits() is slow in calculating splits based on these blocks.

We can safely remove this customized getFileBlockLocations() implementation, fall back to the default FileSystem.getFileBlockLocations() implementation, which is to return 1 block for any file with 1 host "localhost". Note that this doesn't mean we will create much less splits, because the number of splits is still limited by the blockSize in FileInputFormat.computeSplitSize():

return Math.max(minSize, Math.min(goalSize, blockSize));

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-15320.01.patch
28/Mar/18 00:32
11 kB
Christopher Douglas
HADOOP-15320.patch
16/Mar/18 22:29
10 kB
shanyu zhao

Issue Links

relates to

HADOOP-14943 Add common getFileBlockLocations() emulation for object stores, including S3A

Patch Available

Activity

People

Assignee:: shanyu zhao

Reporter:: shanyu zhao

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 16/Mar/18 22:20

Updated:: 28/Mar/18 19:28

Resolved:: 28/Mar/18 19:05