Hadoop Common
  1. Hadoop Common
  2. HADOOP-5192

Block reciever should not remove a finalized block when block replication fails

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.18.3
    • Fix Version/s: 0.18.4, 0.19.1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      HADOOP-4702 makes block receivers to remove the received block in case of a block replication failure. But the block should not be removed if the cause of the failure is that the block to be received already exists. The key is that a block receiver should allow to remove only partial blocks in case of block replication failures.

      1. 5192_20090407_0.18.patch
        5 kB
        Tsz Wo Nicholas Sze
      2. blockCleanup.patch
        5 kB
        Hairong Kuang
      3. blockCleanup1.patch
        5 kB
        Hairong Kuang
      4. blockCleanup1-br18.patch
        3 kB
        Hairong Kuang
      5. blockRemove.patch
        0.7 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Raghu Angadi added a comment -

          It does not look like it still correct yet.
          Say one thread is already writing a block 'B'.
          and there here is another request to write the same block, then the second thread will see that 'B' still not a "valid block" and delete the files!

          I think only the receiver thread that actually the owner/creator of the block should delete it.

          Show
          Raghu Angadi added a comment - It does not look like it still correct yet. Say one thread is already writing a block 'B'. and there here is another request to write the same block, then the second thread will see that 'B' still not a "valid block" and delete the files! I think only the receiver thread that actually the owner/creator of the block should delete it.
          Hide
          Raghu Angadi added a comment -

          Making this a "part of" HADOOP-4702. Jira needs a "follow up" type of "Linking"

          Show
          Raghu Angadi added a comment - Making this a "part of" HADOOP-4702 . Jira needs a "follow up" type of "Linking"
          Hide
          Hairong Kuang added a comment -

          Good catch Raghu! It seems that BlockReceiver should not remove the block in case of the following two exceptions thrown by FSDataset#writeToBlock
          1. throw new IOException("Block " + b + " is valid, and cannot be written to.");
          2. throw new IOException("Block " + b + " has already been started (though not completed), and thus cannot be created.");

          My patch handles only 1.

          Show
          Hairong Kuang added a comment - Good catch Raghu! It seems that BlockReceiver should not remove the block in case of the following two exceptions thrown by FSDataset#writeToBlock 1. throw new IOException("Block " + b + " is valid, and cannot be written to."); 2. throw new IOException("Block " + b + " has already been started (though not completed), and thus cannot be created."); My patch handles only 1.
          Hide
          Konstantin Shvachko added a comment -

          There were several issues related to failure of TestInjectionForSimulatedStorage: HADOOP-2412, HADOOP-3882, HADOOP-3900, HADOOP-4168.
          I've just linked them together. If this solves the problem could you please link this issues to the old ones and close them all together.

          Show
          Konstantin Shvachko added a comment - There were several issues related to failure of TestInjectionForSimulatedStorage: HADOOP-2412 , HADOOP-3882 , HADOOP-3900 , HADOOP-4168 . I've just linked them together. If this solves the problem could you please link this issues to the old ones and close them all together.
          Hide
          Hairong Kuang added a comment -

          I am not sure if this jira could fix the failure of TestInjectionForSimulatedStorage. HADOOP-4702 was committed to 0.18.3 recently. We saw the failure much earlier before HADOOP-4702 was committed.

          Show
          Hairong Kuang added a comment - I am not sure if this jira could fix the failure of TestInjectionForSimulatedStorage. HADOOP-4702 was committed to 0.18.3 recently. We saw the failure much earlier before HADOOP-4702 was committed.
          Hide
          Hairong Kuang added a comment -

          How about having a BlockAlreadyExistsException class which inherits IOException? BlockAlreadyExistsException is thrown in the above two cases.

          Show
          Hairong Kuang added a comment - How about having a BlockAlreadyExistsException class which inherits IOException? BlockAlreadyExistsException is thrown in the above two cases.
          Hide
          Hairong Kuang added a comment -

          This patch does not remove any block on BlockAlreadyExistsException during block replication.

          Show
          Hairong Kuang added a comment - This patch does not remove any block on BlockAlreadyExistsException during block replication.
          Hide
          Raghu Angadi added a comment -

          I don't think it is still correct. 'isValidBlock()' check and 'unfinalizeBlock()' are not done under the same instance of lock. The block could become valid between those.

          Show
          Raghu Angadi added a comment - I don't think it is still correct. 'isValidBlock()' check and 'unfinalizeBlock()' are not done under the same instance of lock. The block could become valid between those.
          Hide
          Raghu Angadi added a comment -

          Actually, addition of 'BlockAlreadyExistsException' does not actually improve upon the earlier patch (since '!isValidBlock()' would be false in that case any way).

          Show
          Raghu Angadi added a comment - Actually, addition of 'BlockAlreadyExistsException' does not actually improve upon the earlier patch (since '!isValidBlock()' would be false in that case any way).
          Hide
          Raghu Angadi added a comment -

          > addition of 'BlockAlreadyExistsException' does not actually improve upon the earlier patch

          oh.. this exception is thrown both cases (1. when the block exists, and 2. when the block is being written). From the name of the class, I thought it was thrown only for (1).

          In order fix the locking, you could introduce 'unfinalizeBlockIfInvalid()' that does the check and deletion under one lock.

          Show
          Raghu Angadi added a comment - > addition of 'BlockAlreadyExistsException' does not actually improve upon the earlier patch oh.. this exception is thrown both cases (1. when the block exists, and 2. when the block is being written). From the name of the class, I thought it was thrown only for (1). In order fix the locking, you could introduce 'unfinalizeBlockIfInvalid()' that does the check and deletion under one lock.
          Hide
          Hairong Kuang added a comment -

          > In order fix the locking, you could introduce 'unfinalizeBlockIfInvalid()' that does the check and deletion under one lock.
          I do not think this is necessary. When a block is in the unfinalized state under the tmp directory, no other thread can move the partial block to the finalized state. Adding the lock 'unfinalizeBlockIfInvalid() is just being super cautious. If we assume that there could be other threads working on the partial block concurrently, even with the lock, unfinalizeBlockIfInvalid is not safe at all because it may remove data that other threads are writing to.

          Show
          Hairong Kuang added a comment - > In order fix the locking, you could introduce 'unfinalizeBlockIfInvalid()' that does the check and deletion under one lock. I do not think this is necessary. When a block is in the unfinalized state under the tmp directory, no other thread can move the partial block to the finalized state. Adding the lock 'unfinalizeBlockIfInvalid() is just being super cautious. If we assume that there could be other threads working on the partial block concurrently, even with the lock, unfinalizeBlockIfInvalid is not safe at all because it may remove data that other threads are writing to.
          Hide
          Raghu Angadi added a comment -

          > Adding the lock 'unfinalizeBlockIfInvalid() is just being super cautious.

          umm.. I would call that 'good practice'. between :

          • {{'lock(p); bool cond = condition(); unlock(p); if (cond) {lock (p); destroy(); unlock(p) ;}

            '}} and

          • 'lock (p); if (condition()) destroy(); unlock(p);'

          to me the former looks instinctively error prone (in many cases would really be a bug).. and it would be hard track in future since others wouldn't always look in this corner. There is not even comment why the multiple locking is correct. In this case you are mostly right.. only based on global knowledge of DataNode. Any one reading the code just around that, wouldn't easily see that.

          w.r.t deletion, it is better to err on not deleting than to err on deletion... if I am not mistaken, that is essence of the problem with HADOOP-4702, which this jira proposes to fix.

          Anyway, it is not a -1 from me.

          Show
          Raghu Angadi added a comment - > Adding the lock 'unfinalizeBlockIfInvalid() is just being super cautious. umm.. I would call that 'good practice'. between : {{'lock(p); bool cond = condition(); unlock(p); if (cond) {lock (p); destroy(); unlock(p) ;} '}} and 'lock (p); if (condition()) destroy(); unlock(p);' to me the former looks instinctively error prone (in many cases would really be a bug).. and it would be hard track in future since others wouldn't always look in this corner. There is not even comment why the multiple locking is correct. In this case you are mostly right.. only based on global knowledge of DataNode. Any one reading the code just around that, wouldn't easily see that. w.r.t deletion, it is better to err on not deleting than to err on deletion... if I am not mistaken, that is essence of the problem with HADOOP-4702 , which this jira proposes to fix. Anyway, it is not a -1 from me.
          Hide
          Hairong Kuang added a comment -

          The essence of the problem with HADOOP-4702 is that it gives a chance for a thread to delete a block which is not created by itself. BlockAlreadyExistedException prevents this from happening. With the validity check, it allows the writer to remove a block that it is writing to if the block is not finalized when error occurs. I agree what you proposed is a good practice, but cleanupBlock does not aim at a multi-threads situation.

          Show
          Hairong Kuang added a comment - The essence of the problem with HADOOP-4702 is that it gives a chance for a thread to delete a block which is not created by itself. BlockAlreadyExistedException prevents this from happening. With the validity check, it allows the writer to remove a block that it is writing to if the block is not finalized when error occurs. I agree what you proposed is a good practice, but cleanupBlock does not aim at a multi-threads situation.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good to me except it needs to add serialVersionUID to the new exception class.

          In our case, if a block is invalid, it remains invalid forever. So we may not need a lock.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 patch looks good to me except it needs to add serialVersionUID to the new exception class. In our case, if a block is invalid, it remains invalid forever. So we may not need a lock.
          Hide
          Raghu Angadi added a comment -

          > In our case, if a block is invalid, it remains invalid forever. So we may not need a lock.

          Every block becomes valid once.. i.e. it is invalid before that instant. The above statement is not correct I think.

          Show
          Raghu Angadi added a comment - > In our case, if a block is invalid, it remains invalid forever. So we may not need a lock. Every block becomes valid once.. i.e. it is invalid before that instant. The above statement is not correct I think.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > In our case, if a block is invalid, it remains invalid forever. So we may not need a lock.

          Every block becomes valid once.. i.e. it is invalid before that instant. The above statement is not correct I think.

          You are right. Hairong also pointed this out. I think we could check whether the block is valid inside unfinalizeBlock(..).

          Show
          Tsz Wo Nicholas Sze added a comment - > In our case, if a block is invalid, it remains invalid forever. So we may not need a lock. Every block becomes valid once.. i.e. it is invalid before that instant. The above statement is not correct I think. You are right. Hairong also pointed this out. I think we could check whether the block is valid inside unfinalizeBlock(..).
          Hide
          Hairong Kuang added a comment -

          It turns out that method unfinalizeBlock does not remove finalized block. So there is no need for validity check and locking is not an issue either.

          Show
          Hairong Kuang added a comment - It turns out that method unfinalizeBlock does not remove finalized block. So there is no need for validity check and locking is not an issue either.
          Hide
          Hairong Kuang added a comment -

          Here is the patch that removes validity check in cleanupBlock and adds serialVersionUID to BlocksAlreadyExistsException.

          Show
          Hairong Kuang added a comment - Here is the patch that removes validity check in cleanupBlock and adds serialVersionUID to BlocksAlreadyExistsException.
          Hide
          Hairong Kuang added a comment -

          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

          ant test-core passed all unit tests except for TestMapReduceLocal.

          Show
          Hairong Kuang added a comment - [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. ant test-core passed all unit tests except for TestMapReduceLocal.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1

          Show
          Tsz Wo Nicholas Sze added a comment - +1
          Hide
          Hairong Kuang added a comment -

          I've committed this.

          Show
          Hairong Kuang added a comment - I've committed this.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #756 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/756/ )
          Hide
          Hairong Kuang added a comment -

          Attach the patch for 0.18.

          Show
          Hairong Kuang added a comment - Attach the patch for 0.18.
          Hide
          Tsz Wo Nicholas Sze added a comment - - edited

          blockCleanup1-br18.patch does not include BlockAlreadyExistsException.java

          5192_20090407_0.18.patch: added BlockAlreadyExistsException.java

          Show
          Tsz Wo Nicholas Sze added a comment - - edited blockCleanup1-br18.patch does not include BlockAlreadyExistsException.java 5192_20090407_0.18.patch: added BlockAlreadyExistsException.java

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development