[HDFS-10220] A large number of expired leases can make namenode unresponsive and cause failover - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 3.0.0-alpha1
Component/s: namenode
Labels:
None

Hadoop Flags:

Reviewed
Release Note:
Two new configuration have been added "dfs.namenode.lease-recheck-interval-ms" and "dfs.namenode.max-lock-hold-to-release-lease-ms" to fine tune the duty cycle with which the Namenode recovers old leases.

Description

I have faced a namenode failover due to unresponsive namenode detected by the zkfc with lot's of WARN messages (5 millions) like this one:
org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file closed.

On the threaddump taken by the zkfc there are lots of thread blocked due to a lock.

Looking at the code, there are a lock taken by the LeaseManager.Monitor when some lease must be released. Due to the really big number of lease to be released the namenode has taken too many times to release them blocking all other tasks and making the zkfc thinking that the namenode was not available/stuck.

The idea of this patch is to limit the number of leased released each time we check for lease so the lock won't be taken for a too long time period.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-10220.001.patch
28/Mar/16 13:14
9 kB
Nicolas Fraison
HADOOP-10220.002.patch
31/Mar/16 08:38
10 kB
Nicolas Fraison
HADOOP-10220.003.patch
02/Apr/16 06:32
11 kB
Nicolas Fraison
HADOOP-10220.004.patch
15/Apr/16 13:59
11 kB
Nicolas Fraison
HADOOP-10220.005.patch
26/Apr/16 06:58
8 kB
Nicolas Fraison
HADOOP-10220.006.patch
06/May/16 21:06
8 kB
Nicolas Fraison
HADOOP-10220.007.patch
01/Jun/16 07:23
13 kB
Nicolas Fraison
threaddump_zkfc.txt
28/Mar/16 13:14
809 kB
Nicolas Fraison

Issue Links

is related to

HDFS-6757 Simplify lease manager with INodeID

Resolved

relates to

HDFS-13977 NameNode can kill itself if it tries to send too many txns to a QJM simultaneously

Resolved

Activity

People

Assignee:: Nicolas Fraison

Reporter:: Nicolas Fraison

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 28/Mar/16 13:12

Updated:: 13/Dec/18 00:19

Resolved:: 08/Jun/16 20:49