Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3399 BookKeeper option support for NN HA
  3. HDFS-3452

BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0.0-alpha
    • Fix Version/s: 3.0.0, 2.0.2-alpha
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Normal switch fails.
      (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 5000. By the time control comes to acquire lock the previous lock is not released which leads to failure in lock acquisition by NN and NN gets shutdown. Ideally it should have been done)
      =============================================================================
      2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: Failed to acquire lock with /ledgers/lock/lock-0000000007, lock-0000000006 already has it
      2012-05-09 20:15:29,732 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec, stream=null))
      java.io.IOException: Could not acquire lock
      at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
      at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
      at org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
      at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
      at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
      at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
      at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
      at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
      at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
      at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
      at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
      at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
      at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
      at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
      2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
      /************************************************************
      SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX

      Scenario:
      Start ZKFCS, NNs
      NN1 is active and NN2 is standby
      Stop NN1. NN2 tries to transition to active and gets shut down

      1. BK-253-BKJM.patch
        17 kB
        Uma Maheswara Rao G
      2. HDFS-3452.patch
        25 kB
        Uma Maheswara Rao G
      3. HDFS-3452.patch
        25 kB
        Uma Maheswara Rao G
      4. HDFS-3452-1.patch
        28 kB
        Uma Maheswara Rao G
      5. HDFS-3452-2.patch
        28 kB
        Uma Maheswara Rao G

        Issue Links

          Activity

          Hide
          Rakesh R added a comment -

          I think here the problem is, BKJM doesn't have the logic to release the writelock when shutdown the Active NN. Also, I have couple of observation, hope would be useful.

          1. zkclient session timeout is hard coded and I feel, should be configurable.
            ZooKeeper(zkConnect, 3000, new ZkConnectionWatcher()); //3000 as default.
          2. BKJM has writelock(zk distributed lock) which has different session timeout from zkfc.
            This would cause many inconsistencies as mentioned in the defect. IMO, BKJM should share the same lock of zkfc as writelock or should ensure BKJM zk client is using the same ZK cluster and has session timeout less than zkfc.
          Show
          Rakesh R added a comment - I think here the problem is, BKJM doesn't have the logic to release the writelock when shutdown the Active NN. Also, I have couple of observation, hope would be useful. zkclient session timeout is hard coded and I feel, should be configurable. ZooKeeper(zkConnect, 3000, new ZkConnectionWatcher()); //3000 as default. BKJM has writelock(zk distributed lock) which has different session timeout from zkfc. This would cause many inconsistencies as mentioned in the defect. IMO, BKJM should share the same lock of zkfc as writelock or should ensure BKJM zk client is using the same ZK cluster and has session timeout less than zkfc.
          Hide
          Uma Maheswara Rao G added a comment -

          Here I am thinking about 2 scenarios may create this problem:
          a)
          1) ZKFC1 and ActiveNN is running in one machine.
          2) ZKFC2 and StandbyNN is running in another machine.
          3) Now ZKFC1 got killed. So, now ZKFC2 may get notification to switch to active and fence the previous active node. As part of fencing, it may kill -9 the current active NN and after that it may switch the current standByNN to active. But previous NN (bkjm) might not clean the lock immediately because it will wait till session timeout(3000).
          4) So, the switch may fail as othen NN lock not release yet.

          We may have stop the BKJM grecefully(BKJM should ensure his lock cleaned on stop instead of waiting for session timeout) instead of kill -9 as part of Fencing?

          b)
          1) ZKFC1 and ActiveNN is running in one machine.
          2) ZKFC2 and StandbyNN is running in another machine.
          3) Stop the current NN gracefully, Here BKJM will just close his zkclient, but lock may not be released immediately?.(wait for session timeout 3000)
          4) Now ZKFC1 will detect the ActiveNode state and found that it is down and it will quit the leader election.
          ZKFC2 may try to switch other node imediately because of other node deletion notification (before 3000ms itself).
          5) So, the switch may fail again here, because other node might not released the lock yet.

          Yes, Having the same locking mechanism as ZKFC and giving the same path can solve the problems I feel.

          1.zkclient session timeout is hard coded and I feel, should be configurable.
          ZooKeeper(zkConnect, 3000, new ZkConnectionWatcher()); //3000 as default.

          Good to make it configurable.

          Show
          Uma Maheswara Rao G added a comment - Here I am thinking about 2 scenarios may create this problem: a) 1) ZKFC1 and ActiveNN is running in one machine. 2) ZKFC2 and StandbyNN is running in another machine. 3) Now ZKFC1 got killed. So, now ZKFC2 may get notification to switch to active and fence the previous active node. As part of fencing, it may kill -9 the current active NN and after that it may switch the current standByNN to active. But previous NN (bkjm) might not clean the lock immediately because it will wait till session timeout(3000). 4) So, the switch may fail as othen NN lock not release yet. We may have stop the BKJM grecefully(BKJM should ensure his lock cleaned on stop instead of waiting for session timeout) instead of kill -9 as part of Fencing? b) 1) ZKFC1 and ActiveNN is running in one machine. 2) ZKFC2 and StandbyNN is running in another machine. 3) Stop the current NN gracefully, Here BKJM will just close his zkclient, but lock may not be released immediately?.(wait for session timeout 3000) 4) Now ZKFC1 will detect the ActiveNode state and found that it is down and it will quit the leader election. ZKFC2 may try to switch other node imediately because of other node deletion notification (before 3000ms itself). 5) So, the switch may fail again here, because other node might not released the lock yet. Yes, Having the same locking mechanism as ZKFC and giving the same path can solve the problems I feel. 1.zkclient session timeout is hard coded and I feel, should be configurable. ZooKeeper(zkConnect, 3000, new ZkConnectionWatcher()); //3000 as default. Good to make it configurable.
          Hide
          Uma Maheswara Rao G added a comment -

          Ivan, Whats your suggestion on having the common(for ZKFC and BKJM) zk node to have the lock?

          Show
          Uma Maheswara Rao G added a comment - Ivan, Whats your suggestion on having the common(for ZKFC and BKJM) zk node to have the lock?
          Hide
          Uma Maheswara Rao G added a comment -

          Marking it as blocker, as basic switch will not work with Automatic failover.

          Show
          Uma Maheswara Rao G added a comment - Marking it as blocker, as basic switch will not work with Automatic failover.
          Hide
          Uma Maheswara Rao G added a comment -

          One proposal for addressing this issue:

          ZKFC1 and NN1 in one machine.
          ZKFC2 and NN2 in another machine.

          NN1 acting as ACTIVE and NN2 is STANDBY.
          Somehow(due to some N/W partions) NN2 gets the notification to become active. But NN1 also still acting as ACTIVE.

          Solution for fence:
          When NN2 becoming active, it should be able to aquire the lock and clean other node locks.
          Here the actual problem is, NN1 may still write the data to bookie. To solve this problem, when NN1 recovering the ledger, it also should check his lock node presence for acting as active. If the lock node is not there, then it is no more in active roles. Standby will not create any lock nodes in ZK.
          With this proposal we can not call them as locks, But it is like write permission .

          Thoughts?

          Show
          Uma Maheswara Rao G added a comment - One proposal for addressing this issue: ZKFC1 and NN1 in one machine. ZKFC2 and NN2 in another machine. NN1 acting as ACTIVE and NN2 is STANDBY. Somehow(due to some N/W partions) NN2 gets the notification to become active. But NN1 also still acting as ACTIVE. Solution for fence: When NN2 becoming active, it should be able to aquire the lock and clean other node locks. Here the actual problem is, NN1 may still write the data to bookie. To solve this problem, when NN1 recovering the ledger, it also should check his lock node presence for acting as active. If the lock node is not there, then it is no more in active roles. Standby will not create any lock nodes in ZK. With this proposal we can not call them as locks, But it is like write permission . Thoughts?
          Hide
          Ivan Kelly added a comment -

          The NN1 should release the lock when it finalizes it's current segment. FSEditLog#close calls endCurrentLogSegment which calls finalizeSegment on the journalSet.

          There was another JIRA, HDFS-3386 about a similar issue. Perhaps what you are seeing is another manifestation of that.

          Show
          Ivan Kelly added a comment - The NN1 should release the lock when it finalizes it's current segment. FSEditLog#close calls endCurrentLogSegment which calls finalizeSegment on the journalSet. There was another JIRA, HDFS-3386 about a similar issue. Perhaps what you are seeing is another manifestation of that.
          Hide
          Ivan Kelly added a comment -

          Also, I agree with harmonising with the ZKFC. If the ZKFC is being used, we should warn if the configured timeout is higher than ZKFC timeout. If it is not configured, we should default to 90% of ZKFC timeout, or so.

          Show
          Ivan Kelly added a comment - Also, I agree with harmonising with the ZKFC. If the ZKFC is being used, we should warn if the configured timeout is higher than ZKFC timeout. If it is not configured, we should default to 90% of ZKFC timeout, or so.
          Hide
          Uma Maheswara Rao G added a comment -

          Hi Ivan, Thanks a lot for taking a look.

          The NN1 should release the lock when it finalizes it's current segment. FSEditLog#close calls endCurrentLogSegment which calls finalizeSegment on the journalSet.

          What is my point addressing is, before NN1 finalizeSegment itself, NN2 can become active and will get shutdown, because lock has not released by it's peer node.

          There was another JIRA, HDFS-3386 about a similar issue. Perhaps what you are seeing is another manifestation of that.

          Yes, that is similar, but different issue. With our proposed fix, this also should get addressed.

          Also, I agree with harmonising with the ZKFC. If the ZKFC is being used, we should warn if the configured timeout is higher than ZKFC timeout. If it is not configured, we should default to 90% of ZKFC timeout, or so.

          But the problem is, ZKFC and NN are different processes. It is not necessarly true that ZKFC configurations will be available in NN also. So, here only the option is to document.

          Show
          Uma Maheswara Rao G added a comment - Hi Ivan, Thanks a lot for taking a look. The NN1 should release the lock when it finalizes it's current segment. FSEditLog#close calls endCurrentLogSegment which calls finalizeSegment on the journalSet. What is my point addressing is, before NN1 finalizeSegment itself, NN2 can become active and will get shutdown, because lock has not released by it's peer node. There was another JIRA, HDFS-3386 about a similar issue. Perhaps what you are seeing is another manifestation of that. Yes, that is similar, but different issue. With our proposed fix, this also should get addressed. Also, I agree with harmonising with the ZKFC. If the ZKFC is being used, we should warn if the configured timeout is higher than ZKFC timeout. If it is not configured, we should default to 90% of ZKFC timeout, or so. But the problem is, ZKFC and NN are different processes. It is not necessarly true that ZKFC configurations will be available in NN also. So, here only the option is to document.
          Hide
          Ivan Kelly added a comment -

          Ah yes, I had misunderstood the problem. I think the Write "permission" node will work, but it needs a small modification to ensure that in the time period between deleting and acquiring the write "permission" and creating the using the ledger, and other node doesn't come in and do the same. I think it should work as follows.

          There is one znode, the write permission znode, /journal/writeLock
          When a node wants to start writing, it must read the znode to see what the current inprogress_znode is. At this point it saves the version of the writeLock znode. It then recovers the inprogress_znode, which will fence the ledger which it is using. It creates its own ledger, and then writes the new inprogress_znode to writeLock, using the version it previously saved.
          If another node has tried to start writing before this, the version will have changed, so the write will fail.

          Show
          Ivan Kelly added a comment - Ah yes, I had misunderstood the problem. I think the Write "permission" node will work, but it needs a small modification to ensure that in the time period between deleting and acquiring the write "permission" and creating the using the ledger, and other node doesn't come in and do the same. I think it should work as follows. There is one znode, the write permission znode, /journal/writeLock When a node wants to start writing, it must read the znode to see what the current inprogress_znode is. At this point it saves the version of the writeLock znode. It then recovers the inprogress_znode, which will fence the ledger which it is using. It creates its own ledger, and then writes the new inprogress_znode to writeLock, using the version it previously saved. If another node has tried to start writing before this, the version will have changed, so the write will fail.
          Hide
          Rakesh R added a comment -

          @Ivan

          but it needs a small modification to ensure that in the time period between deleting and acquiring the write "permission" and creating the using the ledger, and other node doesn't come in and do the same

          Hope you are pointing to the window gap between 'delete & create' operations and chances of race condition.

          Can we use ZooKeeper MultiTransactionRecord api like,
          Op.delete("delete",
          Op.create("create",
          zk.multi(ops);

          I feel, this would resolve the race condition. what's your opinion?

          Also, I didn't fully understand the versioning concept what you are proposing?

          Show
          Rakesh R added a comment - @Ivan but it needs a small modification to ensure that in the time period between deleting and acquiring the write "permission" and creating the using the ledger, and other node doesn't come in and do the same Hope you are pointing to the window gap between 'delete & create' operations and chances of race condition. Can we use ZooKeeper MultiTransactionRecord api like, Op.delete("delete", Op.create("create", zk.multi(ops); I feel, this would resolve the race condition. what's your opinion? Also, I didn't fully understand the versioning concept what you are proposing?
          Hide
          Rakesh R added a comment -

          @Ivan,

          Oh. You meant, recovering the inprogress_znode will release the write permission and startLogSegment will again try acquiring the write permission. In that case, we could not go with multi() option, since these are two different calls. I also feel, logic based on znode version would work.

          Show
          Rakesh R added a comment - @Ivan, Oh. You meant, recovering the inprogress_znode will release the write permission and startLogSegment will again try acquiring the write permission. In that case, we could not go with multi() option, since these are two different calls. I also feel, logic based on znode version would work.
          Hide
          Ivan Kelly added a comment -

          Yes, thats exactly what I mean. I've been trying to formulate a possible race for this for the last hours, but I haven't been able. Once I come up with one, ill post it here.

          Show
          Ivan Kelly added a comment - Yes, thats exactly what I mean. I've been trying to formulate a possible race for this for the last hours, but I haven't been able. Once I come up with one, ill post it here.
          Hide
          Ivan Kelly added a comment -

          If the race doesn't exist, it would be possible to simply 'lock' using the inprogress znode.

          Show
          Ivan Kelly added a comment - If the race doesn't exist, it would be possible to simply 'lock' using the inprogress znode.
          Hide
          Uma Maheswara Rao G added a comment -

          @Ivan,

          There is one znode, the write permission znode, /journal/writeLock
          When a node wants to start writing, it must read the znode to see what the current inprogress_znode is. At this point it saves the version of the writeLock znode. It then recovers the inprogress_znode, which will fence the ledger which it is using. It creates its own ledger, and then writes the new inprogress_znode to writeLock, using the version it previously saved.
          If another node has tried to start writing before this, the version will have changed, so the write will fail.

          I am not sure, I followed you correctly.
          This is waht i understood.
          When NN2 tries to become active where NN1 already acting as active, it will have new version id in ZK and do the ledger recoveries. Finally have the comparision check with his saved versionid before proceeding for write.

          In between, if NN1 also recovering and creating new ledger, he might have diff versionid and after recovery version id might changed by NN2, it will fail.

          Show
          Uma Maheswara Rao G added a comment - @Ivan, There is one znode, the write permission znode, /journal/writeLock When a node wants to start writing, it must read the znode to see what the current inprogress_znode is. At this point it saves the version of the writeLock znode. It then recovers the inprogress_znode, which will fence the ledger which it is using. It creates its own ledger, and then writes the new inprogress_znode to writeLock, using the version it previously saved. If another node has tried to start writing before this, the version will have changed, so the write will fail. I am not sure, I followed you correctly. This is waht i understood. When NN2 tries to become active where NN1 already acting as active, it will have new version id in ZK and do the ledger recoveries. Finally have the comparision check with his saved versionid before proceeding for write. In between, if NN1 also recovering and creating new ledger, he might have diff versionid and after recovery version id might changed by NN2, it will fail.
          Hide
          Ivan Kelly added a comment -

          @Uma
          This is what I was suggesting.

          Show
          Ivan Kelly added a comment - @Uma This is what I was suggesting.
          Hide
          Uma Maheswara Rao G added a comment -

          @Ivan, Updated very basic patch. Did not includ the tests in this patch. This patch is just for checking the approach.
          I will include the tests in next version of patch.
          Now I could not find any reason for adding the permission lock in recoverUnfinalizedSegments. Actually we can track this version check while creating ledger itself. While startLogSegment, we will set the permission data and get the version number. After creating the ledger, we will check the permissions version number by setting the data again with the previously saved version number as you proposed previously.
          We have verified this with ZKFC and manual failover modes. Working well. Still I am trying to find the gaps. Now I have uploaded this basic version of patch. You can provide your feedback on approach.

          Thanks
          Uma

          Show
          Uma Maheswara Rao G added a comment - @Ivan, Updated very basic patch. Did not includ the tests in this patch. This patch is just for checking the approach. I will include the tests in next version of patch. Now I could not find any reason for adding the permission lock in recoverUnfinalizedSegments. Actually we can track this version check while creating ledger itself. While startLogSegment, we will set the permission data and get the version number. After creating the ledger, we will check the permissions version number by setting the data again with the previously saved version number as you proposed previously. We have verified this with ZKFC and manual failover modes. Working well. Still I am trying to find the gaps. Now I have uploaded this basic version of patch. You can provide your feedback on approach. Thanks Uma
          Hide
          Uma Maheswara Rao G added a comment -

          BTW, do you mind moving it to under HDFS-3399?

          Show
          Uma Maheswara Rao G added a comment - BTW, do you mind moving it to under HDFS-3399 ?
          Hide
          Ivan Kelly added a comment -

          Moved to HDFS.

          I've thought about this a bit more. Your patch is a good start, but it actually does more than we need in some parts. Really, the purpose of the locking is to ensure that we do not add new entries without having read all previous entries. Locking on the creation of inprogress znodes should be enough to ensure this. Fencing should take care of any other cases.

          startLogSegment should work as follows:

          1. get version(V) and content(C) of writePermissions znode. C is the path to an inprogress znode (Z1), or null
          2. if Z1 exists, throw an exception. Otherwise proceed.
          3. create inprogress znode(Z2) and ledger.
          4. write writePermissions znode with Z2 and V.

          finalizeLogSegment should read writePermissions znode and null it if content matches the inprogress znode it is finalizing.

          So,
          a) I think WritePermission should be called something more like CurrentInprogress.
          b) The interface should be something like

          public class CurrentInprogress {
              String readCurrent(); // returns current znode or null
              void updateCurrent(String path) throws Exception;
              void clearCurrent();
          }
          

          c) This only ever needs to be used in startLogSegment. #clearCurrent is really optional, but there for completeness.
          d) #checkPermission is unnecessary. If something else has opened another inprogress znode while we are writing, it should have closed the ledger we were writing to, thereby fencing it, thereby stopping any further writes.
          e) The actual data stored in the znode should include a version number, a hostname and then the path. This will make debugging easier.
          f) You have some tabs.

          Show
          Ivan Kelly added a comment - Moved to HDFS. I've thought about this a bit more. Your patch is a good start, but it actually does more than we need in some parts. Really, the purpose of the locking is to ensure that we do not add new entries without having read all previous entries. Locking on the creation of inprogress znodes should be enough to ensure this. Fencing should take care of any other cases. startLogSegment should work as follows: get version(V) and content(C) of writePermissions znode. C is the path to an inprogress znode (Z1), or null if Z1 exists, throw an exception. Otherwise proceed. create inprogress znode(Z2) and ledger. write writePermissions znode with Z2 and V. finalizeLogSegment should read writePermissions znode and null it if content matches the inprogress znode it is finalizing. So, a) I think WritePermission should be called something more like CurrentInprogress. b) The interface should be something like public class CurrentInprogress { String readCurrent(); // returns current znode or null void updateCurrent( String path) throws Exception; void clearCurrent(); } c) This only ever needs to be used in startLogSegment. #clearCurrent is really optional, but there for completeness. d) #checkPermission is unnecessary. If something else has opened another inprogress znode while we are writing, it should have closed the ledger we were writing to, thereby fencing it, thereby stopping any further writes. e) The actual data stored in the znode should include a version number, a hostname and then the path. This will make debugging easier. f) You have some tabs.
          Hide
          Uma Maheswara Rao G added a comment -

          Thanks Ivan for your feedback. Infact I was thinking initially about inProgress file for achieving this locking. I like this approach.

          public class CurrentInprogress {
              String readCurrent(); // returns current znode or null
              void updateCurrent(String path) throws Exception;
              void clearCurrent();
          }
          

          I will just use 'read, update and clear' instead of appending Current. Because we already used it in Class name itself. Mostly I will upload the final patch tomorrow.

          Show
          Uma Maheswara Rao G added a comment - Thanks Ivan for your feedback. Infact I was thinking initially about inProgress file for achieving this locking. I like this approach. public class CurrentInprogress { String readCurrent(); // returns current znode or null void updateCurrent( String path) throws Exception; void clearCurrent(); } I will just use 'read, update and clear' instead of appending Current. Because we already used it in Class name itself. Mostly I will upload the final patch tomorrow.
          Hide
          Uma Maheswara Rao G added a comment -

          Attached the patch by addressing the Ivan's feedback on initial proposal.
          Added tests for CurrentInprogress file and verified integration part with BKJM mulpitle writer tests.
          Also verified manually, all basic flows worked fine.

          Note that this patch applies on top of HDFS-3058.

          Show
          Uma Maheswara Rao G added a comment - Attached the patch by addressing the Ivan's feedback on initial proposal. Added tests for CurrentInprogress file and verified integration part with BKJM mulpitle writer tests. Also verified manually, all basic flows worked fine. Note that this patch applies on top of HDFS-3058 .
          Hide
          Uma Maheswara Rao G added a comment -

          Rebased the patch based on latest trunk and HDFS-3058 changes.

          Show
          Uma Maheswara Rao G added a comment - Rebased the patch based on latest trunk and HDFS-3058 changes.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12529704/HDFS-3452.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal:

          org.apache.hadoop.contrib.bkjournal.TestBookKeeperAsHASharedDir
          org.apache.hadoop.contrib.bkjournal.TestBookKeeperJournalManager

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2519//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2519//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12529704/HDFS-3452.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal: org.apache.hadoop.contrib.bkjournal.TestBookKeeperAsHASharedDir org.apache.hadoop.contrib.bkjournal.TestBookKeeperJournalManager +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2519//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2519//console This message is automatically generated.
          Hide
          Ivan Kelly added a comment -

          Patch looks good Uma. A few comments.

          1. The version in the data should be a format version, in case we wish to change the data format in future. Not the znode version
          2. the creation of the inprogress znode should catch a nodeexists exception in the case to two nodes starting at once
          3. in javadoc, @update should be #update
          4. "Already inprogress node exists" -> "Inprogress node already exists"
          5. TestBookKeeperJournalManager#testAllBookieFailure: you need to add bkjm.recoverUnfinalizedSegments() before the failing startLogSegment.
          6. TestBookKeeperAsHASharedDir#testMultiplePrimariesStarted: this needs to be changed. Fix is simple though, now that the locking has changed, we it's the nn who was previously working which dies, not the new one trying to start. Code below
            @Test
            public void testMultiplePrimariesStarted() throws Exception {
              Runtime mockRuntime1 = mock(Runtime.class);
              Runtime mockRuntime2 = mock(Runtime.class);
              Path p1 = new Path("/testBKJMMultiplePrimary");
              Path p2 = new Path("/testBKJMMultiplePrimary2");
          
              MiniDFSCluster cluster = null;
              try {
                Configuration conf = new Configuration();
                conf.setInt(DFSConfigKeys.DFS_HA_TAILEDITS_PERIOD_KEY, 1);
                conf.set(DFSConfigKeys.DFS_NAMENODE_SHARED_EDITS_DIR_KEY,
                         BKJMUtil.createJournalURI("/hotfailoverMultiple").toString());
                BKJMUtil.addJournalManagerDefinition(conf);
          
                cluster = new MiniDFSCluster.Builder(conf)
                  .nnTopology(MiniDFSNNTopology.simpleHATopology())
                  .numDataNodes(0)
                  .manageNameDfsSharedDirs(false)
                  .build();
                NameNode nn1 = cluster.getNameNode(0);
                NameNode nn2 = cluster.getNameNode(1);
                FSEditLogTestUtil.setRuntimeForEditLog(nn1, mockRuntime1);
                FSEditLogTestUtil.setRuntimeForEditLog(nn2, mockRuntime2);
                cluster.waitActive();
                cluster.transitionToActive(0);
          
                FileSystem fs = HATestUtil.configureFailoverFs(cluster, conf);
                fs.mkdirs(p1);
                nn1.getRpcServer().rollEditLog();
                cluster.transitionToActive(1);
          
                verify(mockRuntime1, times(0)).exit(anyInt());
                fs.mkdirs(p2);
          
                verify(mockRuntime1, atLeastOnce()).exit(anyInt());
                verify(mockRuntime2, times(0)).exit(anyInt());
          
              } finally {
                if (cluster != null) {
                  cluster.shutdown();
                }
              }
            }
          

          Other than that, I think this is ready to go. Good work

          Show
          Ivan Kelly added a comment - Patch looks good Uma. A few comments. The version in the data should be a format version, in case we wish to change the data format in future. Not the znode version the creation of the inprogress znode should catch a nodeexists exception in the case to two nodes starting at once in javadoc, @update should be #update "Already inprogress node exists" -> "Inprogress node already exists" TestBookKeeperJournalManager#testAllBookieFailure: you need to add bkjm.recoverUnfinalizedSegments() before the failing startLogSegment. TestBookKeeperAsHASharedDir#testMultiplePrimariesStarted: this needs to be changed. Fix is simple though, now that the locking has changed, we it's the nn who was previously working which dies, not the new one trying to start. Code below @Test public void testMultiplePrimariesStarted() throws Exception { Runtime mockRuntime1 = mock( Runtime .class); Runtime mockRuntime2 = mock( Runtime .class); Path p1 = new Path( "/testBKJMMultiplePrimary" ); Path p2 = new Path( "/testBKJMMultiplePrimary2" ); MiniDFSCluster cluster = null ; try { Configuration conf = new Configuration(); conf.setInt(DFSConfigKeys.DFS_HA_TAILEDITS_PERIOD_KEY, 1); conf.set(DFSConfigKeys.DFS_NAMENODE_SHARED_EDITS_DIR_KEY, BKJMUtil.createJournalURI( "/hotfailoverMultiple" ).toString()); BKJMUtil.addJournalManagerDefinition(conf); cluster = new MiniDFSCluster.Builder(conf) .nnTopology(MiniDFSNNTopology.simpleHATopology()) .numDataNodes(0) .manageNameDfsSharedDirs( false ) .build(); NameNode nn1 = cluster.getNameNode(0); NameNode nn2 = cluster.getNameNode(1); FSEditLogTestUtil.setRuntimeForEditLog(nn1, mockRuntime1); FSEditLogTestUtil.setRuntimeForEditLog(nn2, mockRuntime2); cluster.waitActive(); cluster.transitionToActive(0); FileSystem fs = HATestUtil.configureFailoverFs(cluster, conf); fs.mkdirs(p1); nn1.getRpcServer().rollEditLog(); cluster.transitionToActive(1); verify(mockRuntime1, times(0)).exit(anyInt()); fs.mkdirs(p2); verify(mockRuntime1, atLeastOnce()).exit(anyInt()); verify(mockRuntime2, times(0)).exit(anyInt()); } finally { if (cluster != null ) { cluster.shutdown(); } } } Other than that, I think this is ready to go. Good work
          Hide
          Uma Maheswara Rao G added a comment -

          Thank a lot Ivan for the review!

          I have already corrected tests. simial fixed as you suggested here.

          5.TestBookKeeperJournalManager#testAllBookieFailure: you need to add bkjm.recoverUnfinalizedSegments() before the failing startLogSegment.
          6.TestBookKeeperAsHASharedDir#testMultiplePrimariesStarted: this needs to be changed. Fix is simple though, now that the locking has changed, we it's the nn who was previously working which dies, not the new one trying to start. Code below
          

          I will address remaining comments and post the patch in some time.

          Show
          Uma Maheswara Rao G added a comment - Thank a lot Ivan for the review! I have already corrected tests. simial fixed as you suggested here. 5.TestBookKeeperJournalManager#testAllBookieFailure: you need to add bkjm.recoverUnfinalizedSegments() before the failing startLogSegment. 6.TestBookKeeperAsHASharedDir#testMultiplePrimariesStarted: this needs to be changed. Fix is simple though, now that the locking has changed, we it's the nn who was previously working which dies, not the new one trying to start. Code below I will address remaining comments and post the patch in some time.
          Hide
          Uma Maheswara Rao G added a comment -

          1.The version in the data should be a format version, in case we wish to change the data format in future. Not the znode version

          Oh, since you mentioned that this is for debugging i thought it is of Znode. Now i am using format version number. I am assuming, this is the version number you are mentioning here.

          2.the creation of the inprogress znode should catch a nodeexists exception in the case to two nodes starting at once

          done. good one. somehow, I missed it.

          3.in javadoc, @update should be #update

          done.

          4."Already inprogress node exists" -> "Inprogress node already exists"

          done

          5.TestBookKeeperJournalManager#testAllBookieFailure: you need to add bkjm.recoverUnfinalizedSegments() before the failing startLogSegment.
          6.TestBookKeeperAsHASharedDir#testMultiplePrimariesStarted:

          done.

          Thanks a lot, Ivan for your reviews.

          Show
          Uma Maheswara Rao G added a comment - 1.The version in the data should be a format version, in case we wish to change the data format in future. Not the znode version Oh, since you mentioned that this is for debugging i thought it is of Znode. Now i am using format version number. I am assuming, this is the version number you are mentioning here. 2.the creation of the inprogress znode should catch a nodeexists exception in the case to two nodes starting at once done. good one. somehow, I missed it. 3.in javadoc, @update should be #update done. 4."Already inprogress node exists" -> "Inprogress node already exists" done 5.TestBookKeeperJournalManager#testAllBookieFailure: you need to add bkjm.recoverUnfinalizedSegments() before the failing startLogSegment. 6.TestBookKeeperAsHASharedDir#testMultiplePrimariesStarted: done. Thanks a lot, Ivan for your reviews.
          Hide
          Uma Maheswara Rao G added a comment -

          All tests passed:
          -------------------------------------------------------
          T E S T S
          -------------------------------------------------------
          Running org.apache.hadoop.contrib.bkjournal.TestBookKeeperAsHASharedDir
          Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 33.384 sec
          Running org.apache.hadoop.contrib.bkjournal.TestBookKeeperHACheckpoints
          Running org.apache.hadoop.contrib.bkjournal.TestBookKeeperJournalManager
          Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 10.364 sec
          Running org.apache.hadoop.contrib.bkjournal.TestCurrentInprogress
          Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.149 sec

          Results :

          Tests run: 16, Failures: 0, Errors: 0, Skipped: 0

          Show
          Uma Maheswara Rao G added a comment - All tests passed: ------------------------------------------------------- T E S T S ------------------------------------------------------- Running org.apache.hadoop.contrib.bkjournal.TestBookKeeperAsHASharedDir Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 33.384 sec Running org.apache.hadoop.contrib.bkjournal.TestBookKeeperHACheckpoints Running org.apache.hadoop.contrib.bkjournal.TestBookKeeperJournalManager Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 10.364 sec Running org.apache.hadoop.contrib.bkjournal.TestCurrentInprogress Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.149 sec Results : Tests run: 16, Failures: 0, Errors: 0, Skipped: 0
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12529848/HDFS-3452-1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2521//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2521//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12529848/HDFS-3452-1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2521//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2521//console This message is automatically generated.
          Hide
          Ivan Kelly added a comment -

          The new patch looks almost ready to go. I have a couple of comments though.

          1. in finalizeLogSegment, ci.clear() is in a finally block. This means that the currentInprogress is cleared even if finalization fails. I think finalizeLogSegment should call ci.read() and only call clear if the inprogress znode it is finalizing matches.
          2. clear() should use versionNumberForPermission in its setData
          3. read() should check the layout version and fail if it's greater than the current layout version. Also, i think the currentInprogress layout should have it's own layout version, rather than using the BKJM one.
          Show
          Ivan Kelly added a comment - The new patch looks almost ready to go. I have a couple of comments though. in finalizeLogSegment, ci.clear() is in a finally block. This means that the currentInprogress is cleared even if finalization fails. I think finalizeLogSegment should call ci.read() and only call clear if the inprogress znode it is finalizing matches. clear() should use versionNumberForPermission in its setData read() should check the layout version and fail if it's greater than the current layout version. Also, i think the currentInprogress layout should have it's own layout version, rather than using the BKJM one.
          Hide
          Uma Maheswara Rao G added a comment -

          Hi Ivan, Thanks a lot for taking a look.

          Just a clarification.
          Is this the change you are suggesting for #1?
          before clearing we should have the check because, at the same time other node may come and create his node. But this current node will clear the inprogress path and change the version number. So, other node may fail further. Is this the point you are pointing here?

                 maxTxId.store(lastTxId);
                 zkc.delete(inprogressPath, inprogressStat.getVersion());
          +      String inprogressPathFromCI = ci.read();
          +      if (inprogressPathFromCI.equals(inprogressPath)) {
          +        ci.clear();
          +      }
               } catch (KeeperException e) {
                 throw new IOException("Error finalising ledger", e);
               } catch (InterruptedException ie) {
                 throw new IOException("Error finalising ledger", ie);
          -    } finally {
          -      wl.release();
          -    }
          +    } 
          

          for #3, yes I agree, we can have the version check. Also thought to have seperate version number, again I felt it may not be good to maintain version numbers for each node. We may can update single version number if any changes for any node. ofcource version checks still can maintain like in NN for all edit log OPs.
          Anyway I will create seperate version number for CurrentInprogress node and will have the version check.

          could you please clarify above #1.

          Show
          Uma Maheswara Rao G added a comment - Hi Ivan, Thanks a lot for taking a look. Just a clarification. Is this the change you are suggesting for #1? before clearing we should have the check because, at the same time other node may come and create his node. But this current node will clear the inprogress path and change the version number. So, other node may fail further. Is this the point you are pointing here? maxTxId.store(lastTxId); zkc.delete(inprogressPath, inprogressStat.getVersion()); + String inprogressPathFromCI = ci.read(); + if (inprogressPathFromCI.equals(inprogressPath)) { + ci.clear(); + } } catch (KeeperException e) { throw new IOException( "Error finalising ledger" , e); } catch (InterruptedException ie) { throw new IOException( "Error finalising ledger" , ie); - } finally { - wl.release(); - } + } for #3, yes I agree, we can have the version check. Also thought to have seperate version number, again I felt it may not be good to maintain version numbers for each node. We may can update single version number if any changes for any node. ofcource version checks still can maintain like in NN for all edit log OPs. Anyway I will create seperate version number for CurrentInprogress node and will have the version check. could you please clarify above #1.
          Hide
          Ivan Kelly added a comment -

          For #1, what you have is good.

          Show
          Ivan Kelly added a comment - For #1, what you have is good.
          Hide
          Uma Maheswara Rao G added a comment -

          Updated the patch by addressing the Ivan comments.

          Show
          Uma Maheswara Rao G added a comment - Updated the patch by addressing the Ivan comments.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12530052/HDFS-3452-2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2532//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2532//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530052/HDFS-3452-2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2532//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2532//console This message is automatically generated.
          Hide
          Ivan Kelly added a comment -

          lgtm +1

          Show
          Ivan Kelly added a comment - lgtm +1
          Hide
          Uma Maheswara Rao G added a comment -

          I have just committed this to trunk and branch-2.
          Thanks a lot, Ivan for the reviews!

          Show
          Uma Maheswara Rao G added a comment - I have just committed this to trunk and branch-2. Thanks a lot, Ivan for the reviews!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #2370 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2370/)
          HDFS-3452. BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock. Contributed by Uma Maheswara Rao G. (Revision 1343913)

          Result = SUCCESS
          umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343913
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/CurrentInprogress.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/WriteLock.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperAsHASharedDir.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestCurrentInprogress.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2370 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2370/ ) HDFS-3452 . BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock. Contributed by Uma Maheswara Rao G. (Revision 1343913) Result = SUCCESS umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343913 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/CurrentInprogress.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/WriteLock.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperAsHASharedDir.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestCurrentInprogress.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #2297 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2297/)
          HDFS-3452. BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock. Contributed by Uma Maheswara Rao G. (Revision 1343913)

          Result = SUCCESS
          umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343913
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/CurrentInprogress.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/WriteLock.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperAsHASharedDir.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestCurrentInprogress.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2297 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2297/ ) HDFS-3452 . BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock. Contributed by Uma Maheswara Rao G. (Revision 1343913) Result = SUCCESS umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343913 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/CurrentInprogress.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/WriteLock.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperAsHASharedDir.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestCurrentInprogress.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #2316 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2316/)
          HDFS-3452. BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock. Contributed by Uma Maheswara Rao G. (Revision 1343913)

          Result = FAILURE
          umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343913
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/CurrentInprogress.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/WriteLock.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperAsHASharedDir.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestCurrentInprogress.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2316 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2316/ ) HDFS-3452 . BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock. Contributed by Uma Maheswara Rao G. (Revision 1343913) Result = FAILURE umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343913 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/CurrentInprogress.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/WriteLock.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperAsHASharedDir.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestCurrentInprogress.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1061 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1061/)
          HDFS-3452. BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock. Contributed by Uma Maheswara Rao G. (Revision 1343913)

          Result = SUCCESS
          umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343913
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/CurrentInprogress.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/WriteLock.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperAsHASharedDir.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestCurrentInprogress.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1061 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1061/ ) HDFS-3452 . BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock. Contributed by Uma Maheswara Rao G. (Revision 1343913) Result = SUCCESS umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343913 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/CurrentInprogress.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/WriteLock.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperAsHASharedDir.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestCurrentInprogress.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1095 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1095/)
          HDFS-3452. BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock. Contributed by Uma Maheswara Rao G. (Revision 1343913)

          Result = SUCCESS
          umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343913
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/CurrentInprogress.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/WriteLock.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperAsHASharedDir.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestCurrentInprogress.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1095 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1095/ ) HDFS-3452 . BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock. Contributed by Uma Maheswara Rao G. (Revision 1343913) Result = SUCCESS umamahesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1343913 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/CurrentInprogress.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/WriteLock.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperAsHASharedDir.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestBookKeeperJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/test/java/org/apache/hadoop/contrib/bkjournal/TestCurrentInprogress.java

            People

            • Assignee:
              Uma Maheswara Rao G
              Reporter:
              suja s
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development