[HDFS-7314] When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.6.1, 2.8.0, 2.7.2, 3.0.0-alpha1
Component/s: None
Labels:

Description

It happened in YARN nodemanger scenario. But it could happen to any long running service that use cached instance of DistrbutedFileSystem.

1. Active NN is under heavy load. So it became unavailable for 10 minutes; any DFSClient request will get ConnectTimeoutException.

2. YARN nodemanager use DFSClient for certain write operation such as log aggregator or shared cache in ~~YARN-1492~~. DFSClient used by YARN NM's renewLease RPC got ConnectTimeoutException.

2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  Aborting ...

3. After DFSClient is in Aborted state, YARN NM can't use that cached instance of DistributedFileSystem.

2014-10-29 20:26:23,991 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc...
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
        at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
        at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
        at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

We can make YARN or DFSClient more tolerant to temporary NN unavailability. Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can be addressed at different layers.

YARN closes the DistributedFileSystem object when it receives some well defined exception. Then the next HDFS call will create a new instance of DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS applications need to address this as well.

DistributedFileSystem detects Aborted DFSClient and create a new instance of DFSClient. We will need to fix all the places DistributedFileSystem calls DFSClient.

After DFSClient gets into Aborted state, it doesn't have to reject all requests , instead it can retry. If NN is available again it can transition to healthy state.

Comments?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-7314-branch-2.7.2.txt
10/Sep/15 19:58
7 kB
Vinod Kumar Vavilapalli
HDFS-7314-9.patch
14/Jul/15 03:39
6 kB
Ming Ma
HDFS-7314-8.patch
08/Jul/15 20:39
6 kB
Ming Ma
HDFS-7314-7.patch
11/Nov/14 02:55
8 kB
Ming Ma
HDFS-7314-6.patch
10/Nov/14 23:22
7 kB
Ming Ma
HDFS-7314-5.patch
08/Nov/14 03:04
7 kB
Ming Ma
HDFS-7314-4.patch
07/Nov/14 04:38
9 kB
Ming Ma
HDFS-7314-3.patch
06/Nov/14 20:58
6 kB
Ming Ma
HDFS-7314-2.patch
04/Nov/14 19:54
6 kB
Ming Ma
HDFS-7314.patch
04/Nov/14 06:52
9 kB
Ming Ma

Activity

People

Assignee:: Ming Ma

Reporter:: Ming Ma

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 31/Oct/14 03:53

Updated:: 06/Jan/17 01:19

Resolved:: 16/Jul/15 20:01