[HDFS-8182] Implement topology-aware CDN-style caching - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.6.0
Fix Version/s: None
Component/s: hdfs-client, namenode
Labels:
None

Description

To scale reads of hot blocks in large clusters, it would be beneficial if we could read a block across the ToR switches only once. Example scenarios are localization of binaries, MR distributed cache files for map-side joins and similar. There are multiple layers where this could be implemented (YARN service or individual apps such as MR) but I believe it is best done in HDFS or even common FileSystem to support as many use cases as possible.

The life cycle could look like this e.g. for the YARN localization scenario:
1. inputStream = fs.open(path, ..., CACHE_IN_RACK)
2. instead of reading from a remote DN directly, NN tells the client to read via the local DN1 and the DN1 creates a replica of each block.

When the next localizer on DN2 in the same rack starts it will learn from NN about the replica in DN1 and the client will read from DN1 using the conventional path.

When the application ends the AM or NM's can instruct the NN in a fadvise DONTNEED style, it can start telling DN's to discard extraneous replica.

Attachments

Issue Links

is related to

YARN-5396 YARN large file broadcast service

Open

Activity

People

Assignee:: Unassigned

Reporter:: Gera Shegalov

Votes:: 1 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 20/Apr/15 03:11

Updated:: 02/Aug/16 18:05