[HDFS-6681] TestRBWBlockInvalidation#testBlockInvalidationWhenRBWReplicaMissedInDN is flaky and sometimes gets stuck in infinite loops - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.1
Fix Version/s: None
Component/s: None
Labels:
- BB2015-05-TBR
- flaky-test
Environment:

Hide

Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

Linux [hostname] 2.6.32-279.14.1.el6.x86_64 #1 SMP Mon Oct 15 13:44:51 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

Show
Java(TM) SE Runtime Environment (build 1.6.0_31-b04) Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode) Linux [hostname] 2.6.32-279.14.1.el6.x86_64 #1 SMP Mon Oct 15 13:44:51 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

Description

This testcase has 3 infinite loops which break only on certain conditions being satisfied.

1st loop checks if there should be a single live replica. It assumes this to be true since it has just corrupted a block on one of the datanodes (testcase has replication factor as 2). One scenario in which this loop will never break is if the Namenode invalidates the corrupt replica, schedules a replication command, and the new copied replica is added all before this testcase has the chance to check the live-replica count.

2nd loop checks there should be 2 live replicas. It assumes this to be true (in some time) since the first loop has broken implying there is a single replica and now it is only a matter of time when the Namenode schedules a replication command to copy a replica to another datanode. One scenario in which this loop will never break is when the Namenode tries to schedule a new replica on the same node on which we actually corrupted the block. That dst. datanode will not copy the block, complaining that it already has the (corrupted) replica in the create state. The situation that results is that Namenode has scheduled a copy to a datanode, the block is now in the namenode's pending replication queue, this block will never be removed from the pending replication queue because the namenode will never receive a report from the datanodes that the block is 'added'.

Note: The block can be transferred from the 'pending replication' to "needed replication" queue once the pending timeout (5 minutes) expires. The Namenode then actively tries to schedule a replication for blocks in 'needed replication' queue. This can cause the 2nd loop to break but the time in which this process gets kicked in is more than 5 minutes.

3rd loop: This loops checks if there are no corrupt replicas. I don't see a scenario in which this loop can go on for ever, since once the live replica count goes back to normal (2), the corrupted block will be removed

I guess increasing the heart beat interval time, so that the testcase has enough time to check condition in loop 1 before a datanode reports a successful copy should help avoid race condition in loop1. Regarding loop2 I guess we can reduce the timeout after which the block is transferred from the pending replication to the needed replication queue.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-6681.patch
15/Jul/14 06:01
1 kB
Ratandeep Ratti
HDFS-6681.001.patch
12/Mar/18 21:46
1 kB
Zsolt Venczel

Activity

People

Assignee:: Ratandeep Ratti

Reporter:: Ratandeep Ratti

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 15/Jul/14 05:42

Updated:: 23/Oct/20 02:45