[HDFS-6617] Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Test
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.5.0, 3.0.0-alpha1
Fix Version/s: 2.6.0
Component/s: auto-failover, test
Labels:
None

Target Version/s:

2.6.0
Hadoop Flags:

Reviewed

Description

Just Hit a false alarm testing while working at ~~HDFS-6614~~, see https://builds.apache.org/job/PreCommit-HDFS-Build/7259//testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestDFSZKFailoverController/testManualFailoverWithDFSHAAdmin/

After a looking at the log, shows the failure came from a timeout at
ZKFailoverController.doCedeActive():
localTarget.getProxy(conf, timeout).transitionToStandby(createReqInfo());

While stopping active service, see FSNamesystem.stopActiveServices():
void stopActiveServices() {
LOG.info("Stopping services started for active state");
....
this corelates with the log:
"2014-07-01 08:12:50,615 INFO namenode.FSNamesystem (FSNamesystem.java:stopActiveServices(1167)) - Stopping services started for active state"

then stopActiveServices will call editLog.close(), which goes to endCurrentLogSegment(), see log:
2014-07-01 08:12:50,616 INFO namenode.FSEditLog (FSEditLog.java:endCurrentLogSegment(1216)) - Ending log segment 1

but this operation did not finish in 5 seconds, then triggered the timeout:

2014-07-01 08:12:55,624 WARN ha.ZKFailoverController (ZKFailoverController.java:doCedeActive(577)) - Unable to transition local node to standby: Call From asf001.sp2.ygridcore.net/67.195.138.31 to localhost:10021 failed on socket timeout exception: java.net.SocketTimeoutException: 5000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:53965 remote=localhost/127.0.0.1:10021]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout

the logEdit/logSync finally done followed with printStatistics(true):
2014-07-01 08:13:05,243 INFO namenode.FSEditLog (FSEditLog.java:printStatistics(675)) - Number of transactions: 2 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 3 SyncTimes(ms): 14667 74 105

so obviously, this long sync contributed the timeout, maybe the QA box is very slow at that moment, so one possible fix here is setting the default fence timeout to a bigger one.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-6617.txt
01/Jul/14 10:45
0.9 kB
Liang Xie
HDFS-6617-v2.txt
03/Jul/14 10:02
1 kB
Liang Xie

Activity

People

Assignee:: Liang Xie

Reporter:: Liang Xie

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Jul/14 10:40

Updated:: 12/May/16 18:12

Resolved:: 07/Jul/14 17:39