Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15646 Track failing tests in HDFS
  3. HDFS-6681

TestRBWBlockInvalidation#testBlockInvalidationWhenRBWReplicaMissedInDN is flaky and sometimes gets stuck in infinite loops

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 2.4.1
    • None
    • None

    Description

      This testcase has 3 infinite loops which break only on certain conditions being satisfied.

      1st loop checks if there should be a single live replica. It assumes this to be true since it has just corrupted a block on one of the datanodes (testcase has replication factor as 2). One scenario in which this loop will never break is if the Namenode invalidates the corrupt replica, schedules a replication command, and the new copied replica is added all before this testcase has the chance to check the live-replica count.

      2nd loop checks there should be 2 live replicas. It assumes this to be true (in some time) since the first loop has broken implying there is a single replica and now it is only a matter of time when the Namenode schedules a replication command to copy a replica to another datanode. One scenario in which this loop will never break is when the Namenode tries to schedule a new replica on the same node on which we actually corrupted the block. That dst. datanode will not copy the block, complaining that it already has the (corrupted) replica in the create state. The situation that results is that Namenode has scheduled a copy to a datanode, the block is now in the namenode's pending replication queue, this block will never be removed from the pending replication queue because the namenode will never receive a report from the datanodes that the block is 'added'.

      Note: The block can be transferred from the 'pending replication' to "needed replication" queue once the pending timeout (5 minutes) expires. The Namenode then actively tries to schedule a replication for blocks in 'needed replication' queue. This can cause the 2nd loop to break but the time in which this process gets kicked in is more than 5 minutes.

      3rd loop: This loops checks if there are no corrupt replicas. I don't see a scenario in which this loop can go on for ever, since once the live replica count goes back to normal (2), the corrupted block will be removed

      I guess increasing the heart beat interval time, so that the testcase has enough time to check condition in loop 1 before a datanode reports a successful copy should help avoid race condition in loop1. Regarding loop2 I guess we can reduce the timeout after which the block is transferred from the pending replication to the needed replication queue.

      Attachments

        1. HDFS-6681.patch
          1 kB
          Ratandeep Ratti
        2. HDFS-6681.001.patch
          1 kB
          Zsolt Venczel

        Activity

          People

            rdsr Ratandeep Ratti
            rdsr Ratandeep Ratti
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: