[HDFS-13977] NameNode can kill itself if it tries to send too many txns to a QJM simultaneously - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.7.7
Fix Version/s: 2.10.0, 3.3.0, 3.2.1, 3.1.3
Component/s: namenode, qjm
Labels:
None

Target Version/s:

2.7.8, 3.1.4

Description

Problem & Logs

We recently encountered an issue on a large cluster (running 2.7.4) in which the NameNode killed itself because it was unable to communicate with the JNs via QJM. We discovered that it was the result of the NameNode trying to send a huge batch of over 1 million transactions to the JNs in a single RPC:

NameNode Logs

WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal X.X.X.X:XXXX failed to
 write txns 10000000-11153636. Will try to write to this JN again after the next log roll.
...
WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 1098ms to send a batch of 1153637 edits (335886611 bytes) to remote journal X.X.X.X:XXXX

JournalNode Logs

INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8485: readAndProcess from client X.X.X.X threw exception [java.io.IOException: Requested data length 335886776 is longer than maximum configured RPC length 67108864.  RPC came from X.X.X.X]
java.io.IOException: Requested data length 335886776 is longer than maximum configured RPC length 67108864.  RPC came from X.X.X.X
        at org.apache.hadoop.ipc.Server$Connection.checkDataLength(Server.java:1610)
        at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1672)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:897)
        at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:753)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:724)

The JournalNodes rejected the RPC because it had a size well over the 64MB default ipc.maximum.data.length.

This was triggered by a huge number of files all hitting a hard lease timeout simultaneously, causing the NN to force-close them all at once. This can be a particularly nasty bug as the NN will attempt to re-send this same huge RPC on restart, as it loads an fsimage which still has all of these open files that need to be force-closed.

Proposed Solution

To solve this we propose to modify EditsDoubleBuffer to add a "hard limit" based on the value of ipc.maximum.data.length. When writeOp() or writeRaw() is called, first check the size of bufCurrent. If it exceeds the hard limit, block the writer until the buffer is flipped and bufCurrent becomes bufReady. This gives some self-throttling to prevent the NameNode from killing itself in this way.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-13977.000.patch
16/Aug/19 18:12
8 kB
Erik Krogen
HDFS-13977.001.patch
20/Aug/19 15:21
8 kB
Erik Krogen
HDFS-13977.002.patch
21/Aug/19 16:03
10 kB
Erik Krogen
HDFS-13977.003.patch
22/Aug/19 16:22
10 kB
Erik Krogen
HDFS-13977-branch-2.003.patch
23/Aug/19 19:27
11 kB
Erik Krogen

Issue Links

is related to

HDFS-10220 A large number of expired leases can make namenode unresponsive and cause failover

Resolved

Activity

People

Assignee:: Erik Krogen

Reporter:: Erik Krogen

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 08/Oct/18 22:33

Updated:: 02/Oct/19 17:12

Resolved:: 26/Aug/19 21:36