[FLINK-19221] Exploit LocatableFileStatus from Hadoop - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.11.1
Fix Version/s: 1.12.0
Component/s: Connectors / Hadoop Compatibility
Labels:
- pull-request-available

Description

When the HDFS Client returns a FileStatus (description of a file) it sometimes returns a LocatedFileStatus which already contains all the BlockLocation information.

We should expose this on the Flink side, because it may save is a lot of RPC calls to the name node. The file enumerators often request block locations for all files, currently doing an RPC call for each file.

When the FileStatus obtained from listing the directory (or getting details for a file) already has all the block locations, we can save the extra RPC call per file.

The suggested implementation is as follows:

1. We introduce a LocatedInputSplit in Flink that we integrate with the built-in LocalFileSystem
2. We integrate this with the HadoopFileSystems by creating a Flink LocatedInputSplit whenever the underlying file system created a Hadoop LocatedInputSplit
3. As a safety net, the FS methods to access block information check whether the presented file status already contains the block information and return that information directly.

Steps one and two are for simplification of FileSystem users (no need to ask for extra info if it is available).

Step three is the transparent shortcut that all applications get even if they do not explicitly use the LocatedInputSplit and keep calling FileSystem.getBlockLocations().

Attachments

Issue Links

links to

GitHub Pull Request #13394

Activity

People

Assignee:: Stephan Ewen

Reporter:: Stephan Ewen

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 14/Sep/20 15:39

Updated:: 15/Sep/20 20:17

Resolved:: 15/Sep/20 20:17