[HADOOP-15684] triggerActiveLogRoll stuck on dead name node, when ConnectTimeoutException happens. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.0.0-alpha1
Fix Version/s: 3.2.0, 3.0.4, 3.1.2
Component/s: ha
Labels:
- multi-sbnn

Target Version/s:

3.2.0, 3.0.4, 3.1.2
Hadoop Flags:

Reviewed
Release Note:
When a namenode A sends request RollEditLog to a remote NN, either the remote NN is standby or IO Exception happens, A should continue to try next NN, instead of getting stuck on the problematic one. This Patch is based on trunk.
Flags:

Patch

Description

When name node call triggerActiveLogRoll, and the cachedActiveProxy is a dead name node, it will throws a ConnectTimeoutException, expected behavior is to try next NN, but current logic doesn't do so, instead, it keeps trying the dead, mistakenly take it as active.

2018-08-17 10:02:12,001 WARN [Edit log tailer] org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN

org.apache.hadoop.net.ConnectTimeoutException: Call From SourceMachine001/SourceIP to001 TargetMachine001.ap.gbl:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:298)

C:\Users\rotang>ping TargetMachine001

Pinging TargetMachine001[TargetIP001] with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.

Attachment is a log file saying how it repeatedly retries a dead name node, and a fix patch.

I replaced the actual machine name/ip as SourceMachine001/SourceIP001 and TargetMachine001/TargetIP001.

How to Repro:

In a good running NNs, take down the active NN (don't let it come back during test), and then the stand by NNs will keep trying dead (old active) NN, because it is the cached one.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0001-RollEditLog-try-next-NN-when-exception-happens.patch
20/Aug/18 21:56
6 kB
Rong Tang
HADOOP-15684.000.patch
21/Aug/18 22:43
6 kB
Rong Tang
HADOOP-15684.001.patch
22/Aug/18 22:05
7 kB
Rong Tang
HADOOP-15684.002.patch
23/Aug/18 00:25
11 kB
Rong Tang
HADOOP-15684.003.patch
31/Aug/18 20:32
11 kB
Rong Tang
HADOOP-15684.004.patch
06/Sep/18 00:01
11 kB
Rong Tang
HADOOP-15684.005.patch
19/Sep/18 19:55
10 kB
Íñigo Goiri
hadoop--rollingUpgrade-SourceMachine001.log
20/Aug/18 23:16
15 kB
Rong Tang

Issue Links

duplicates

HDFS-13900 NameNode: Unable to trigger a roll of the active NN

Resolved

is caused by

HDFS-6440 Support more than 2 NameNodes

Resolved

relates to

HDFS-14397 Backport HADOOP-15684 to branch-2

Resolved

Activity

People

Assignee:: Rong Tang

Reporter:: Rong Tang

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 18/Aug/18 00:06

Updated:: 02/Oct/19 17:13

Resolved:: 19/Sep/18 20:01