Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3731

2.0 release upgrade must handle blocks being written from 1.0

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0.0-alpha
    • Fix Version/s: 0.23.4, 2.0.2-alpha
    • Component/s: datanode
    • Labels:
      None

      Description

      Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @brahmareddy
      The DataNode will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the blocksBeingWritten directory into the rbw directory of this block pool. Similarly, on -finalize, we should delete the blocksBeingWritten directory.

      1. hdfs-3731.branch-023.patch.txt
        23 kB
        Kihwal Lee
      2. hadoop1-bbw.tgz
        39 kB
        Colin Patrick McCabe
      3. HDFS-3731.003.patch
        21 kB
        Colin Patrick McCabe
      4. HDFS-3731.002.patch
        21 kB
        Colin Patrick McCabe

        Issue Links

          Activity

          Suresh Srinivas created issue -
          Hide
          Suresh Srinivas added a comment -

          In the interim, is there a manual procedure for closing the log files in HBase, such as gracefully shutting it down?

          Show
          Suresh Srinivas added a comment - In the interim, is there a manual procedure for closing the log files in HBase, such as gracefully shutting it down?
          Suresh Srinivas made changes -
          Field Original Value New Value
          Target Version/s 2.1.0-alpha [ 12321440 ]
          Hide
          Robert Joseph Evans added a comment -

          I am a bit confused by this, as I am not an expert on HDFS. I am mainly concerned if this does impact 0.23, I assume it does, and if so what that impact is. Does it mean that the datanode could drop the last block from a file because that block is in a bbw file as the datanode is upgraded? You mention HBase here, does this only impact a block that is being written to with hsync?

          Show
          Robert Joseph Evans added a comment - I am a bit confused by this, as I am not an expert on HDFS. I am mainly concerned if this does impact 0.23, I assume it does, and if so what that impact is. Does it mean that the datanode could drop the last block from a file because that block is in a bbw file as the datanode is upgraded? You mention HBase here, does this only impact a block that is being written to with hsync?
          Hide
          Suresh Srinivas added a comment -

          This is a problem for all releases starting from 0.21. The problem is applicable to only the following installs, that used hflush semantics in 0.20.x based releases.

          1. 0.20.203 or later releases with append turned on (includes 1.0.x releases)
          2. 0.20.2 + append branch releases with append turned on

          Installations without append turned on is not affected since blocks being written is not used.

          Show
          Suresh Srinivas added a comment - This is a problem for all releases starting from 0.21. The problem is applicable to only the following installs, that used hflush semantics in 0.20.x based releases. 0.20.203 or later releases with append turned on (includes 1.0.x releases) 0.20.2 + append branch releases with append turned on Installations without append turned on is not affected since blocks being written is not used.
          Hide
          Robert Joseph Evans added a comment -

          Thanks for the clarification Suresh. If it is simple to put this into 0.23 I really would appreciate it. If not I can do the porting myself when the time comes.

          Show
          Robert Joseph Evans added a comment - Thanks for the clarification Suresh. If it is simple to put this into 0.23 I really would appreciate it. If not I can do the porting myself when the time comes.
          Todd Lipcon made changes -
          Assignee Todd Lipcon [ tlipcon ]
          Hide
          Uma Maheswara Rao G added a comment -

          Yep, as Suresh mentioned, we have broken the compatibility for rollbacks with append modes.
          Do we need to deliver a patch for older versions for handling rollbacks? Clusters who wanted to do rollback may need to apply that and proceed for rollBack (mainly for handling partial block folders (BBW)).
          Todd, What is your Idea here for hanlding this?

          Show
          Uma Maheswara Rao G added a comment - Yep, as Suresh mentioned, we have broken the compatibility for rollbacks with append modes. Do we need to deliver a patch for older versions for handling rollbacks? Clusters who wanted to do rollback may need to apply that and proceed for rollBack (mainly for handling partial block folders (BBW)). Todd, What is your Idea here for hanlding this?
          Hide
          Raju added a comment -

          In the upgrade process we usually rename the old folder to previous folder but here we have 2 folders (one contain finalised blocks and other in BBW) to handle ,I would like to propose

          sd.root/previous >>> sd.root/current/BPID/current/finalized
          sd.root/blocksbeingwritten >>> sd.root/current/BPID/current/rbw

          NOTE: >>> hardlink
          NOTE: blocksbeingwritten folder is not renamed since this can cause rollback problem with old versions.

          With the above fix there are some advantages
          1. Simple change in existing code(2.X +) code,
          2. No change required for old version code.(1.X -),
          3. Rollback will work without any new effort,

          Please correct me if I am wrong.

          Show
          Raju added a comment - In the upgrade process we usually rename the old folder to previous folder but here we have 2 folders (one contain finalised blocks and other in BBW) to handle ,I would like to propose sd.root/previous >>> sd.root/current/BPID/current/finalized sd.root/blocksbeingwritten >>> sd.root/current/BPID/current/rbw NOTE: >>> hardlink NOTE: blocksbeingwritten folder is not renamed since this can cause rollback problem with old versions. With the above fix there are some advantages 1. Simple change in existing code(2.X +) code, 2. No change required for old version code.(1.X -), 3. Rollback will work without any new effort, Please correct me if I am wrong.
          Hide
          Todd Lipcon added a comment -

          Hi Raju. The above sounds right to me.

          We should probably hook the "finalize" code to also rm -rf the "blocksbeingwritten" directory, or else the storage will be leaked forever, right?

          Show
          Todd Lipcon added a comment - Hi Raju. The above sounds right to me. We should probably hook the "finalize" code to also rm -rf the "blocksbeingwritten" directory, or else the storage will be leaked forever, right?
          Hide
          Uma Maheswara Rao G added a comment -

          sd.root/previous >>> sd.root/current/BPID/current/finalized
          sd.root/blocksbeingwritten >>> sd.root/current/BPID/current/rbw

          Sounds good to me for Hadoop-2 upgrade. Here we have wrapped all the folders inside current. That gave some advantage here for not changing any code in older versions.

          Other point what I considered is,
          If I am running the version with 20.203 and wanted to upgrade to hadoop-1.
          In this case, still we will have the problem right? Because we are handling blocksBeingWritten as separate folder than current in 20.x versions.
          If you agree, may be we can file separate JIRA for that, as this JIRA mainly talking about 2.0 upgrade from 1.0.

          Show
          Uma Maheswara Rao G added a comment - sd.root/previous >>> sd.root/current/BPID/current/finalized sd.root/blocksbeingwritten >>> sd.root/current/BPID/current/rbw Sounds good to me for Hadoop-2 upgrade. Here we have wrapped all the folders inside current. That gave some advantage here for not changing any code in older versions. Other point what I considered is, If I am running the version with 20.203 and wanted to upgrade to hadoop-1. In this case, still we will have the problem right? Because we are handling blocksBeingWritten as separate folder than current in 20.x versions. If you agree, may be we can file separate JIRA for that, as this JIRA mainly talking about 2.0 upgrade from 1.0.
          Hide
          Robert Joseph Evans added a comment -

          I thought that hardlinks to directories are not typically supported. HSF+ on the mac is the only one I know of that allows it. I am nervous about implementing an upgrade path that will only work on a Mac. Did you actually mean a symbolic link, or did you intend to hardlink all of the files in the directories?

          Show
          Robert Joseph Evans added a comment - I thought that hardlinks to directories are not typically supported. HSF+ on the mac is the only one I know of that allows it. I am nervous about implementing an upgrade path that will only work on a Mac. Did you actually mean a symbolic link, or did you intend to hardlink all of the files in the directories?
          Hide
          Raju added a comment -

          We should probably hook the "finalize" code to also rm -rf the "blocksbeingwritten" directory, or else the storage will be leaked forever, right?

          Yes Todd forget to mention about finalize, we need to delete the BBW dir in finalize

          If you agree, may be we can file separate JIRA for that, as this JIRA mainly talking about 2.0 upgrade from 1.0.

          Uma I accept your opinion but we need to modify the code which is already released, I am not very clear on how to fix the same.

          I thought that hardlinks to directories are not typically supported ...

          Robert Here I am not directly referring to Hardlink like

          ln sourceDir destDir
          

          I am talking about using

          void org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocks(File from, File to, int oldLV, HardLink hl)
          

          which will do the individual file linking for all the blocks.

          Show
          Raju added a comment - We should probably hook the "finalize" code to also rm -rf the "blocksbeingwritten" directory, or else the storage will be leaked forever, right? Yes Todd forget to mention about finalize, we need to delete the BBW dir in finalize If you agree, may be we can file separate JIRA for that, as this JIRA mainly talking about 2.0 upgrade from 1.0. Uma I accept your opinion but we need to modify the code which is already released, I am not very clear on how to fix the same. I thought that hardlinks to directories are not typically supported ... Robert Here I am not directly referring to Hardlink like ln sourceDir destDir I am talking about using void org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocks(File from, File to, int oldLV, HardLink hl) which will do the individual file linking for all the blocks.
          Colin Patrick McCabe made changes -
          Assignee Todd Lipcon [ tlipcon ] Colin Patrick McCabe [ cmccabe ]
          Hide
          Robert Joseph Evans added a comment -

          Is there any update on this? There has been no activity for about a week, and this seems fairly critical to fix.

          Show
          Robert Joseph Evans added a comment - Is there any update on this? There has been no activity for about a week, and this seems fairly critical to fix.
          Hide
          Colin Patrick McCabe added a comment -

          I have a patch that does what was suggested (hardlinking blocks from blocksBeingWritten into rbw on -upgrade). However, it doesn't seem to quite work. Whenever I try to open any of the files that have been handled in this way, I get "java.io.IOException: Cannot obtain block length for LocatedBlock."

          I have confirmed that the DN does know about all the blocks I put into rbw. The log message says "BlockReport of 13 blocks took 1 msec to generate," etc.

          Show
          Colin Patrick McCabe added a comment - I have a patch that does what was suggested (hardlinking blocks from blocksBeingWritten into rbw on -upgrade). However, it doesn't seem to quite work. Whenever I try to open any of the files that have been handled in this way, I get "java.io.IOException: Cannot obtain block length for LocatedBlock." I have confirmed that the DN does know about all the blocks I put into rbw. The log message says "BlockReport of 13 blocks took 1 msec to generate," etc.
          Hide
          Colin Patrick McCabe added a comment -

          Here is the patch. It's a work in progress-- the unit test does not pass yet. As I said before, the DN does definitely pick up on the blocks, but when I try to open one of the files, I am getting this error:

          java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1760620400-127.0.0.1-1344450370534:blk_-4365335954576768874_1019; getBlockSize()=65536; corrupt=false; offset=0;    locs=[127.0.0.1:57476]}
          

          It's possible that this is a lease recovery issue of some kind...

          Show
          Colin Patrick McCabe added a comment - Here is the patch. It's a work in progress-- the unit test does not pass yet. As I said before, the DN does definitely pick up on the blocks, but when I try to open one of the files, I am getting this error: java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1760620400-127.0.0.1-1344450370534:blk_-4365335954576768874_1019; getBlockSize()=65536; corrupt= false ; offset=0; locs=[127.0.0.1:57476]} It's possible that this is a lease recovery issue of some kind...
          Colin Patrick McCabe made changes -
          Attachment HDFS-3731.002.patch [ 12539902 ]
          Hide
          Colin Patrick McCabe added a comment -
          • sample NN / DN tgz used by unit test (I also put this in the patch in git binary diff format)
          Show
          Colin Patrick McCabe added a comment - sample NN / DN tgz used by unit test (I also put this in the patch in git binary diff format)
          Colin Patrick McCabe made changes -
          Attachment hadoop-10-bbw.tgz [ 12539903 ]
          Colin Patrick McCabe made changes -
          Attachment hadoop-10-bbw.tgz [ 12539903 ]
          Hide
          Colin Patrick McCabe added a comment -

          This patch fixes the unit test. Specifically, it triggers lease recovery on the rbw files, so that we can read them and checksum them properly.

          We also retry when we try to open a file and get "Cannot obtain block length for LocatedBlock" (lease recovery is an asynchronous process, so we can't be sure exactly when it will complete.)

          Show
          Colin Patrick McCabe added a comment - This patch fixes the unit test. Specifically, it triggers lease recovery on the rbw files, so that we can read them and checksum them properly. We also retry when we try to open a file and get "Cannot obtain block length for LocatedBlock" (lease recovery is an asynchronous process, so we can't be sure exactly when it will complete.)
          Colin Patrick McCabe made changes -
          Attachment HDFS-3731.003.patch [ 12539987 ]
          Colin Patrick McCabe made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12539987/HDFS-3731.003.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestDatanodeBlockScanner
          org.apache.hadoop.hdfs.TestDFSUpgradeFromImage
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2981//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2981//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12539987/HDFS-3731.003.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDatanodeBlockScanner org.apache.hadoop.hdfs.TestDFSUpgradeFromImage org.apache.hadoop.hdfs.TestFileConcurrentReader +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2981//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2981//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          Unfortunately, our Jenkins infrastructure doesn't yet have support for binary patches (see HADOOP-8291). So TestDFSUpgradeFromImage fails here because Jenkins failed to create the hadoop1-bbw.tgw file that was needed by the test.

          The other two test failures, TestDatanodeBlockScanner and TestFileConcurrentReader, are unrelated to the patch.

          Show
          Colin Patrick McCabe added a comment - Unfortunately, our Jenkins infrastructure doesn't yet have support for binary patches (see HADOOP-8291 ). So TestDFSUpgradeFromImage fails here because Jenkins failed to create the hadoop1-bbw.tgw file that was needed by the test. The other two test failures, TestDatanodeBlockScanner and TestFileConcurrentReader, are unrelated to the patch.
          Colin Patrick McCabe made changes -
          Link This issue is related to HADOOP-8291 [ HADOOP-8291 ]
          Hide
          Robert Joseph Evans added a comment -

          I am not an HDFS expert but the patch looks good to me. +1 non-binding.

          Show
          Robert Joseph Evans added a comment - I am not an HDFS expert but the patch looks good to me. +1 non-binding.
          Hide
          Suresh Srinivas added a comment -

          Colin, can you please add description about the final approach you are taking to solve this problem.

          Show
          Suresh Srinivas added a comment - Colin, can you please add description about the final approach you are taking to solve this problem.
          Colin Patrick McCabe made changes -
          Description Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by Brahma Reddy.
          Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by Brahma Reddy.

          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Hide
          Colin Patrick McCabe added a comment -

          I added a description of the approach to the Description field.

          Show
          Colin Patrick McCabe added a comment - I added a description of the approach to the Description field.
          Hide
          Suresh Srinivas added a comment -

          Colin, is the recovery mechanism for bbw blocks in 1.x and rbw in 2.x compatible?

          Show
          Suresh Srinivas added a comment - Colin, is the recovery mechanism for bbw blocks in 1.x and rbw in 2.x compatible?
          Hide
          Colin Patrick McCabe added a comment -

          The design doc for HDFS-265 says:

          RWR (Replica Waiting to be Recovered): If a DataNode dies and restarts, all its rbw replicas change to be in the rwr state. Rwr replicas will not be in any pipeline and therefore will not receive any new bytes. They will either become out of date or will participate in a lease recovery if the client also dies.

          It seems to me that by putting the blocks into the rbw directory, what will happen when the 2.x DataNode is started is that the blocks will participate in lease recovery after a few minutes have gone past.

          There is a unit test in this patch which short-cuts this process by manually invoking lease recovery on the files and then verifying that they can be read.

          Is there any more documentation about the lease recovery process? As far as I can tell, it seems to work fine on the files in this patch. It might be useful to test waiting for automatic lease recovery to be triggered rather than invoking it manually.

          Show
          Colin Patrick McCabe added a comment - The design doc for HDFS-265 says: RWR (Replica Waiting to be Recovered): If a DataNode dies and restarts, all its rbw replicas change to be in the rwr state. Rwr replicas will not be in any pipeline and therefore will not receive any new bytes. They will either become out of date or will participate in a lease recovery if the client also dies. It seems to me that by putting the blocks into the rbw directory, what will happen when the 2.x DataNode is started is that the blocks will participate in lease recovery after a few minutes have gone past. There is a unit test in this patch which short-cuts this process by manually invoking lease recovery on the files and then verifying that they can be read. Is there any more documentation about the lease recovery process? As far as I can tell, it seems to work fine on the files in this patch. It might be useful to test waiting for automatic lease recovery to be triggered rather than invoking it manually.
          Jeff Hammerbacher made changes -
          Link This issue relates to HBASE-6586 [ HBASE-6586 ]
          Hide
          Colin Patrick McCabe added a comment -

          I ran another experiment where I turned down the hard lease recovery time by a little bit, and then upgraded a cluster with non-empty blocksBeingWritten directory from 1.x to 2.x using this patch.

          I confirmed that LeaseManager#Monitor tracked the files from 1.x that I had left unrecovered. They were eventually recovered after a period of time, as you can see from these logs:

          2012-08-15 17:25:37,029 INFO  hdfs.StateChange (BlockInfoUnderConstruction.java:initializeBlockRecovery(248)) - BLOCK*                                                         blk_4712799930147027732_1021{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[127.0.0.1:48054|RWR]]} recovery started, primary=127.0.0.1:   48054
          2012-08-15 17:25:37,029 WARN  hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3061)) - DIR* NameSystem.internalReleaseLease: File /top-dir-1Mb-512/directory1/file-   with-no-crc has not been closed. Lease recovery is in progress. RecoveryId = 1043 for block blk_4712799930147027732_1021{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,      replicas=[ReplicaUnderConstruction[127.0.0.1:48054|RWR]]}
          2012-08-15 17:25:37,030 INFO  namenode.LeaseManager (LeaseManager.java:checkLeases(444)) - Started block recovery for file /top-dir-1Mb-512/directory1/file-with-no-crc lease  [Lease.  Holder: DFSClient_8256078, pendingcreates: 0]
          2012-08-15 17:25:37,625 INFO  datanode.DataNode (DataNode.java:logRecoverBlock(2016)) - NameNode at keter/127.0.0.1:50472 calls recoverBlock(block=BP-2033981131-127.0.0.1-    1345076136282:blk_7162739548153522810_1020, targets=[127.0.0.1:48054], newGenerationStamp=1031)
          2012-08-15 17:25:37,626 INFO  impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecovery(1416)) - initReplicaRecovery: block=blk_7162739548153522810_1020, recoveryId=1031,    replica=ReplicaWaitingToBeRecovered, blk_7162739548153522810_1020, RWR
            getNumBytes()     = 1024
            getBytesOnDisk()  = 1024
            getVisibleLength()= -1
            getVolume()       = /home/cmccabe/hadoop1/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current
            getBlockFile()    = /home/cmccabe/hadoop1/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-2033981131-127.0.0.1-1345076136282/current/rbw/         blk_7162739548153522810
            unlinked=false
          2012-08-15 17:25:37,626 INFO  impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecovery(1471)) - initReplicaRecovery: changing replica state for blk_7162739548153522810_1020 from RWR to RUR
          2012-08-15 17:25:37,627 INFO  impl.FsDatasetImpl (FsDatasetImpl.java:updateReplicaUnderRecovery(1486)) - updateReplica: block=BP-2033981131-127.0.0.1-1345076136282:           blk_7162739548153522810_1020, recoveryId=1031, length=1024, replica=ReplicaUnderRecovery, blk_7162739548153522810_1020, RUR
            getNumBytes()     = 1024
            getBytesOnDisk()  = 1024
            getVisibleLength()= -1
            getVolume()       = /home/cmccabe/hadoop1/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current
            getBlockFile()    = /home/cmccabe/hadoop1/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-2033981131-127.0.0.1-1345076136282/current/rbw/         blk_7162739548153522810
            recoveryId=1031
            original=ReplicaWaitingToBeRecovered, blk_7162739548153522810_1020, RWR
            getNumBytes()     = 1024
            getBytesOnDisk()  = 1024
            getVisibleLength()= -1
            getVolume()       = /home/cmccabe/hadoop1/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current
            getBlockFile()    = /home/cmccabe/hadoop1/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-2033981131-127.0.0.1-1345076136282/current/rbw/         blk_7162739548153522810
          2012-08-15 17:25:37,629 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(3140)) - commitBlockSynchronization(lastblock=BP-2033981131-127.0.0.1-       1345076136282:blk_7162739548153522810_1020, newgenerationstamp=1031, newlength=1024, newtargets=[127.0.0.1:48054], closeFile=true, deleteBlock=false)
          2012-08-15 17:25:37,631 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(3217)) - commitBlockSynchronization(newblock=BP-2033981131-127.0.0.1-        1345076136282:blk_7162739548153522810_1020, file=/1kb-multiple-checksum-blocks-64-16, newgenerationstamp=1031, newlength=1024, newtargets=[127.0.0.1:48054]) successful
          

          So as far as I can see:
          1. after the upgrade, the fsimage contains leases which are loaded into memory in the upgraded NN
          2. the leases are recovered after a given amount of time by moving the blocks files out of rbw/ and into finalized/

          So basically, as far as I can see, the recovery mechanism for bbw blocks in 1.x and rbw in 2.x is compatible. The LeaseManager code also seems fairly similar too. However, I welcome additional commentary on this, since I am not greatly familiar with this area of the code. Is there anything else we should verify here?

          I also feel like a nice enhancement might be immediately recovering all leases on an upgrade from branch-1. It doesn't make sense to keep leases around unless you think a client can "come back"-- however, branch-1 clients cannot interact with branch-2 NameNodes. (correct me if I'm wrong here?) However, I'm not sure if it's worth investing the time and adding the code complexity to do that. This is already something of a corner case. We may also want to have upgrades in the future where old clients continue to function.

          Show
          Colin Patrick McCabe added a comment - I ran another experiment where I turned down the hard lease recovery time by a little bit, and then upgraded a cluster with non-empty blocksBeingWritten directory from 1.x to 2.x using this patch. I confirmed that LeaseManager#Monitor tracked the files from 1.x that I had left unrecovered. They were eventually recovered after a period of time, as you can see from these logs: 2012-08-15 17:25:37,029 INFO hdfs.StateChange (BlockInfoUnderConstruction.java:initializeBlockRecovery(248)) - BLOCK* blk_4712799930147027732_1021{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[127.0.0.1:48054|RWR]]} recovery started, primary=127.0.0.1: 48054 2012-08-15 17:25:37,029 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3061)) - DIR* NameSystem.internalReleaseLease: File /top-dir-1Mb-512/directory1/file- with-no-crc has not been closed. Lease recovery is in progress. RecoveryId = 1043 for block blk_4712799930147027732_1021{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[127.0.0.1:48054|RWR]]} 2012-08-15 17:25:37,030 INFO namenode.LeaseManager (LeaseManager.java:checkLeases(444)) - Started block recovery for file /top-dir-1Mb-512/directory1/file-with-no-crc lease [Lease. Holder: DFSClient_8256078, pendingcreates: 0] 2012-08-15 17:25:37,625 INFO datanode.DataNode (DataNode.java:logRecoverBlock(2016)) - NameNode at keter/127.0.0.1:50472 calls recoverBlock(block=BP-2033981131-127.0.0.1- 1345076136282:blk_7162739548153522810_1020, targets=[127.0.0.1:48054], newGenerationStamp=1031) 2012-08-15 17:25:37,626 INFO impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecovery(1416)) - initReplicaRecovery: block=blk_7162739548153522810_1020, recoveryId=1031, replica=ReplicaWaitingToBeRecovered, blk_7162739548153522810_1020, RWR getNumBytes() = 1024 getBytesOnDisk() = 1024 getVisibleLength()= -1 getVolume() = /home/cmccabe/hadoop1/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current getBlockFile() = /home/cmccabe/hadoop1/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-2033981131-127.0.0.1-1345076136282/current/rbw/ blk_7162739548153522810 unlinked= false 2012-08-15 17:25:37,626 INFO impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecovery(1471)) - initReplicaRecovery: changing replica state for blk_7162739548153522810_1020 from RWR to RUR 2012-08-15 17:25:37,627 INFO impl.FsDatasetImpl (FsDatasetImpl.java:updateReplicaUnderRecovery(1486)) - updateReplica: block=BP-2033981131-127.0.0.1-1345076136282: blk_7162739548153522810_1020, recoveryId=1031, length=1024, replica=ReplicaUnderRecovery, blk_7162739548153522810_1020, RUR getNumBytes() = 1024 getBytesOnDisk() = 1024 getVisibleLength()= -1 getVolume() = /home/cmccabe/hadoop1/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current getBlockFile() = /home/cmccabe/hadoop1/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-2033981131-127.0.0.1-1345076136282/current/rbw/ blk_7162739548153522810 recoveryId=1031 original=ReplicaWaitingToBeRecovered, blk_7162739548153522810_1020, RWR getNumBytes() = 1024 getBytesOnDisk() = 1024 getVisibleLength()= -1 getVolume() = /home/cmccabe/hadoop1/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current getBlockFile() = /home/cmccabe/hadoop1/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-2033981131-127.0.0.1-1345076136282/current/rbw/ blk_7162739548153522810 2012-08-15 17:25:37,629 INFO namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(3140)) - commitBlockSynchronization(lastblock=BP-2033981131-127.0.0.1- 1345076136282:blk_7162739548153522810_1020, newgenerationstamp=1031, newlength=1024, newtargets=[127.0.0.1:48054], closeFile= true , deleteBlock= false ) 2012-08-15 17:25:37,631 INFO namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(3217)) - commitBlockSynchronization(newblock=BP-2033981131-127.0.0.1- 1345076136282:blk_7162739548153522810_1020, file=/1kb-multiple-checksum-blocks-64-16, newgenerationstamp=1031, newlength=1024, newtargets=[127.0.0.1:48054]) successful So as far as I can see: 1. after the upgrade, the fsimage contains leases which are loaded into memory in the upgraded NN 2. the leases are recovered after a given amount of time by moving the blocks files out of rbw/ and into finalized/ So basically, as far as I can see, the recovery mechanism for bbw blocks in 1.x and rbw in 2.x is compatible. The LeaseManager code also seems fairly similar too. However, I welcome additional commentary on this, since I am not greatly familiar with this area of the code. Is there anything else we should verify here? I also feel like a nice enhancement might be immediately recovering all leases on an upgrade from branch-1. It doesn't make sense to keep leases around unless you think a client can "come back"-- however, branch-1 clients cannot interact with branch-2 NameNodes. (correct me if I'm wrong here?) However, I'm not sure if it's worth investing the time and adding the code complexity to do that. This is already something of a corner case. We may also want to have upgrades in the future where old clients continue to function.
          Hide
          Suresh Srinivas added a comment -

          Colin thanks for the detailed comment. My concern is related to the durability guarantees. Rbw was added specifically to support hsync. The implementation has changed from 1.x to 2.x. I wanted to make sure we ensure we have both sync and lease recovery related testing to ensure hsync files are preserved during upgrade.

          I am currently busy to really look at the code and ensure the correctness of code. Others I think who know this quite well are Todd or Nicholas. Not sure if they have time. I can get back to this may after a week.

          Show
          Suresh Srinivas added a comment - Colin thanks for the detailed comment. My concern is related to the durability guarantees. Rbw was added specifically to support hsync. The implementation has changed from 1.x to 2.x. I wanted to make sure we ensure we have both sync and lease recovery related testing to ensure hsync files are preserved during upgrade. I am currently busy to really look at the code and ensure the correctness of code. Others I think who know this quite well are Todd or Nicholas. Not sure if they have time. I can get back to this may after a week.
          Hide
          Todd Lipcon added a comment -

          I'm on vacation for the next week or so (just doing a quick email check). But I can't think of any reason why the new recovery protocol wouldn't work properly with Colin's solution. So I'm +1 on the design (but haven't looked at the patch)

          Show
          Todd Lipcon added a comment - I'm on vacation for the next week or so (just doing a quick email check). But I can't think of any reason why the new recovery protocol wouldn't work properly with Colin's solution. So I'm +1 on the design (but haven't looked at the patch)
          Hide
          Eli Collins added a comment -

          Colin, the approach and your patch looks good to me. What's the latest on testing? Eg you verified a v1 install with hsync files correctly upgraded (eg leases recovered) to v2 or trunk build with this patch?

          Nicholas, do you have cycles to take a look? I spent a while paging in the relevant code and the patch/testing look good to me but would be good to have another set of eyes. The patch itself is straight-forward so mostly around subtleties.

          Show
          Eli Collins added a comment - Colin, the approach and your patch looks good to me. What's the latest on testing? Eg you verified a v1 install with hsync files correctly upgraded (eg leases recovered) to v2 or trunk build with this patch? Nicholas, do you have cycles to take a look? I spent a while paging in the relevant code and the patch/testing look good to me but would be good to have another set of eyes. The patch itself is straight-forward so mostly around subtleties.
          Hide
          Colin Patrick McCabe added a comment -

          Colin, the approach and your patch looks good to me. What's the latest on testing? Eg you verified a v1 install with hsync files correctly upgraded (eg leases recovered) to v2 or trunk build with this patch?

          I installed branch-1 and started a one-node cluster. I have a small test program which creates a file, calls hsync on it, and then sleeps for a few minutes. During its sleep, I brought down the cluster. Then I had a block in blocksBeingWritten corresponding to the last block of that file that was hsync'ed. (I guess it's called sync in branch-1).

          Then installed a hacked version of trunk running this patch. The hack was to decrease the lease hard recovery time. It is normally hard-coded to be an hour and I didn't want to wait that long.

          I ran ./bin/start-dfs.sh -upgrade. I confirmed that the last block of the file was recovered and moved into rbw, then into finalized/ and then could be read by me.

          Then I stopped the cluster, ran ./bin/namenode -finalize, answered 'Y', and restarted the cluster. The blocksBeingWritten directory was gone, as well as the previous directory.

          The unit test does basically the same thing, in a more systematic way. Hope this helps.

          Show
          Colin Patrick McCabe added a comment - Colin, the approach and your patch looks good to me. What's the latest on testing? Eg you verified a v1 install with hsync files correctly upgraded (eg leases recovered) to v2 or trunk build with this patch? I installed branch-1 and started a one-node cluster. I have a small test program which creates a file, calls hsync on it, and then sleeps for a few minutes. During its sleep, I brought down the cluster. Then I had a block in blocksBeingWritten corresponding to the last block of that file that was hsync'ed. (I guess it's called sync in branch-1). Then installed a hacked version of trunk running this patch. The hack was to decrease the lease hard recovery time. It is normally hard-coded to be an hour and I didn't want to wait that long. I ran ./bin/start-dfs.sh -upgrade . I confirmed that the last block of the file was recovered and moved into rbw , then into finalized/ and then could be read by me. Then I stopped the cluster, ran ./bin/namenode -finalize, answered 'Y', and restarted the cluster. The blocksBeingWritten directory was gone, as well as the previous directory. The unit test does basically the same thing, in a more systematic way. Hope this helps.
          Hide
          Eli Collins added a comment -

          Great +1. The code/design makes sense to me and we know the current path is broken and we've verified the fix. Suresh and Nicholas I'll commit this unless you're going to check it out.

          Show
          Eli Collins added a comment - Great +1. The code/design makes sense to me and we know the current path is broken and we've verified the fix. Suresh and Nicholas I'll commit this unless you're going to check it out.
          Hide
          Colin Patrick McCabe added a comment -

          uploading hadoop1-bbw.tgz separately since git binary diffs don't work with all versions of patch

          Show
          Colin Patrick McCabe added a comment - uploading hadoop1-bbw.tgz separately since git binary diffs don't work with all versions of patch
          Colin Patrick McCabe made changes -
          Attachment hadoop1-bbw.tgz [ 12542345 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12542345/hadoop1-bbw.tgz
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3093//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542345/hadoop1-bbw.tgz against trunk revision . -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3093//console This message is automatically generated.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #2701 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2701/)
          HDFS-3731. 2.0 release upgrade must handle blocks being written from 1.0. Contributed by Colin Patrick McCabe (Revision 1377137)

          Result = SUCCESS
          eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1377137
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSUpgradeFromImage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2701 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2701/ ) HDFS-3731 . 2.0 release upgrade must handle blocks being written from 1.0. Contributed by Colin Patrick McCabe (Revision 1377137) Result = SUCCESS eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1377137 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSUpgradeFromImage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #2637 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2637/)
          HDFS-3731. 2.0 release upgrade must handle blocks being written from 1.0. Contributed by Colin Patrick McCabe (Revision 1377137)

          Result = SUCCESS
          eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1377137
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSUpgradeFromImage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2637 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2637/ ) HDFS-3731 . 2.0 release upgrade must handle blocks being written from 1.0. Contributed by Colin Patrick McCabe (Revision 1377137) Result = SUCCESS eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1377137 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSUpgradeFromImage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz
          Hide
          Eli Collins added a comment -

          I've committed this to trunk, merging to branch-2 now.

          Bobby, would you like me to merge to branch-0.23?

          Show
          Eli Collins added a comment - I've committed this to trunk, merging to branch-2 now. Bobby, would you like me to merge to branch-0.23?
          Hide
          Jason Lowe added a comment -

          Bobby just stepped out, but yes, we are very interested in having this in 0.23. It's one of the things we've been waiting for before officially wrapping up 0.23.3.

          Show
          Jason Lowe added a comment - Bobby just stepped out, but yes, we are very interested in having this in 0.23. It's one of the things we've been waiting for before officially wrapping up 0.23.3.
          Hide
          Eli Collins added a comment -

          Will do. Colin needs to merge the enableManagedDfsDirsRedundancy part of the MiniDFSCluster change from HDFS-3049 to branch-2 and branch-0.23, jira for that coming.

          Show
          Eli Collins added a comment - Will do. Colin needs to merge the enableManagedDfsDirsRedundancy part of the MiniDFSCluster change from HDFS-3049 to branch-2 and branch-0.23, jira for that coming.
          Eli Collins made changes -
          Target Version/s 2.1.0-alpha [ 12321440 ] 0.23.3, 2.2.0-alpha [ 12320052, 12322472 ]
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #2665 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2665/)
          HDFS-3731. 2.0 release upgrade must handle blocks being written from 1.0. Contributed by Colin Patrick McCabe (Revision 1377137)

          Result = FAILURE
          eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1377137
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSUpgradeFromImage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2665 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2665/ ) HDFS-3731 . 2.0 release upgrade must handle blocks being written from 1.0. Contributed by Colin Patrick McCabe (Revision 1377137) Result = FAILURE eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1377137 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSUpgradeFromImage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz
          Colin Patrick McCabe made changes -
          Link This issue is blocked by HDFS-3853 [ HDFS-3853 ]
          Hide
          Colin Patrick McCabe added a comment -

          need this to port to branch-2

          Show
          Colin Patrick McCabe added a comment - need this to port to branch-2
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1145 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1145/)
          HDFS-3731. 2.0 release upgrade must handle blocks being written from 1.0. Contributed by Colin Patrick McCabe (Revision 1377137)

          Result = FAILURE
          eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1377137
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSUpgradeFromImage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1145 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1145/ ) HDFS-3731 . 2.0 release upgrade must handle blocks being written from 1.0. Contributed by Colin Patrick McCabe (Revision 1377137) Result = FAILURE eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1377137 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSUpgradeFromImage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1176 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1176/)
          HDFS-3731. 2.0 release upgrade must handle blocks being written from 1.0. Contributed by Colin Patrick McCabe (Revision 1377137)

          Result = SUCCESS
          eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1377137
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSUpgradeFromImage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1176 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1176/ ) HDFS-3731 . 2.0 release upgrade must handle blocks being written from 1.0. Contributed by Colin Patrick McCabe (Revision 1377137) Result = SUCCESS eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1377137 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSUpgradeFromImage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz
          Hide
          Eli Collins added a comment -

          Merged to branch-2. Ran into a test hang in the branch-0.23 patch, investigating.

          Show
          Eli Collins added a comment - Merged to branch-2. Ran into a test hang in the branch-0.23 patch, investigating.
          Eli Collins made changes -
          Fix Version/s 2.2.0-alpha [ 12322472 ]
          Target Version/s 0.23.3, 2.2.0-alpha [ 12320052, 12322472 ] 0.23.3 [ 12320052 ]
          Hide
          Robert Joseph Evans added a comment -

          Any update on branch-0.23? Do you want me to look into it?

          Show
          Robert Joseph Evans added a comment - Any update on branch-0.23? Do you want me to look into it?
          Hide
          Colin Patrick McCabe added a comment -

          Any update on branch-0.23? Do you want me to look into it?

          There are some differences in the branch-0.23 BlockManager state machine, such that a straight port of the patch doesn't work. The easiest thing to do would probably be to backport some of the BlockManager fixes and improvements to branch-0.23. If you would look into that it would be good.

          Show
          Colin Patrick McCabe added a comment - Any update on branch-0.23? Do you want me to look into it? There are some differences in the branch-0.23 BlockManager state machine, such that a straight port of the patch doesn't work. The easiest thing to do would probably be to backport some of the BlockManager fixes and improvements to branch-0.23. If you would look into that it would be good.
          Hide
          Robert Joseph Evans added a comment -

          Do you have a list of ones you know about? If not I can start pulling on that thread tomorrow.

          Show
          Robert Joseph Evans added a comment - Do you have a list of ones you know about? If not I can start pulling on that thread tomorrow.
          Hide
          Colin Patrick McCabe added a comment -

          Do you have a list of ones you know about? If not I can start pulling on that thread tomorrow.

          Sorry, I just took a preliminary look, didn't have time to go in depth.

          The state machine errors are pretty clear in the test. You may need to wait a while for them to appear since surefire does a lot of buffering.

          Show
          Colin Patrick McCabe added a comment - Do you have a list of ones you know about? If not I can start pulling on that thread tomorrow. Sorry, I just took a preliminary look, didn't have time to go in depth. The state machine errors are pretty clear in the test. You may need to wait a while for them to appear since surefire does a lot of buffering.
          Hide
          Colin Patrick McCabe added a comment -

          Robert is looking into porting the fix to branch-0.23

          Show
          Colin Patrick McCabe added a comment - Robert is looking into porting the fix to branch-0.23
          Colin Patrick McCabe made changes -
          Assignee Colin Patrick McCabe [ cmccabe ] Robert Joseph Evans [ revans2 ]
          Hide
          Robert Joseph Evans added a comment -

          Thanks for reassigning this to me. I have been distracted by a number of other things, but I should get back to is shortly.

          Show
          Robert Joseph Evans added a comment - Thanks for reassigning this to me. I have been distracted by a number of other things, but I should get back to is shortly.
          Kihwal Lee made changes -
          Assignee Robert Joseph Evans [ revans2 ] Kihwal Lee [ kihwal ]
          Hide
          Kihwal Lee added a comment -

          Since Bobby is busy is busy with the 0.23.3 release, I am taking over the porting.

          The test failure happens because all RWR blocks are thrown away. When namenode sees OP_ADD on an existing file while reading edits, it turns the inode into "under construction", but does not do anything to blocks. Since it thinks there is no blocks under construction, the reported RWR blocks are considered corrupt. This only happens in 0.22 and 0.23. In 2.x and trunk, it was fixed as part of HDFS-1623 (HA). This is unacceptable behavior if append/hsync is used.

          I will file a separate jira for 0.22 and 0.23 and make it as a dependency of this jira.

          Show
          Kihwal Lee added a comment - Since Bobby is busy is busy with the 0.23.3 release, I am taking over the porting. The test failure happens because all RWR blocks are thrown away. When namenode sees OP_ADD on an existing file while reading edits, it turns the inode into "under construction", but does not do anything to blocks. Since it thinks there is no blocks under construction, the reported RWR blocks are considered corrupt. This only happens in 0.22 and 0.23. In 2.x and trunk, it was fixed as part of HDFS-1623 (HA). This is unacceptable behavior if append/hsync is used. I will file a separate jira for 0.22 and 0.23 and make it as a dependency of this jira.
          Hide
          Kihwal Lee added a comment -

          Bobby is busy is busy ....

          I meant that he's extremely busy.

          Show
          Kihwal Lee added a comment - Bobby is busy is busy .... I meant that he's extremely busy.
          Hide
          Kihwal Lee added a comment -

          Attaching the patch for branch 0.23. This is a straightforward merge of the trunk patch. However, due to differences in the edits loading in namenode, it requires HDFS-3922.

          When committing, please copy and svn add hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz

          Show
          Kihwal Lee added a comment - Attaching the patch for branch 0.23. This is a straightforward merge of the trunk patch. However, due to differences in the edits loading in namenode, it requires HDFS-3922 . When committing, please copy and svn add hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz
          Kihwal Lee made changes -
          Attachment hdfs-3731.branch-023.patch.txt [ 12545040 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12545040/hdfs-3731.branch-023.patch.txt
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3185//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12545040/hdfs-3731.branch-023.patch.txt against trunk revision . -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3185//console This message is automatically generated.
          Hide
          Daryn Sharp added a comment -

          I've committed to branch-23. Thanks for the port, Kihwal.

          Show
          Daryn Sharp added a comment - I've committed to branch-23. Thanks for the port, Kihwal.
          Daryn Sharp made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.23.4 [ 12320242 ]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #388 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/388/)
          HDFS-3731. 2.0 release upgrade must handle blocks being written from 1.0 (Kihwal Lee via daryn) (Revision 1391155)

          Result = UNSTABLE
          daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1391155
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSUpgradeFromImage.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/common/TestDistributedUpgrade.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #388 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/388/ ) HDFS-3731 . 2.0 release upgrade must handle blocks being written from 1.0 (Kihwal Lee via daryn) (Revision 1391155) Result = UNSTABLE daryn : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1391155 Files : /hadoop/common/branches/branch-0.23/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSUpgradeFromImage.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/common/TestDistributedUpgrade.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop1-bbw.tgz
          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Brahma Reddy Battula made changes -
          Description Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by Brahma Reddy.

          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @Brahma Reddy.

          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Brahma Reddy Battula made changes -
          Description Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @Brahma Reddy.

          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @Brahm

          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Brahma Reddy Battula made changes -
          Description Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @Brahm

          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @brahma

          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Brahma Reddy Battula made changes -
          Description Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @brahma

          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @brahmaredd

          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Brahma Reddy Battula made changes -
          Description Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @brahmaredd

          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @brahmareddy

          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Uma Maheswara Rao G made changes -
          Description Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @brahmareddy

          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @brahmare
          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Uma Maheswara Rao G made changes -
          Description Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @brahmare
          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.
          Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by @brahmareddy
          The {{DataNode}} will only have one block pool after upgrading from a 1.x release. (This is because in the 1.x releases, there were no block pools-- or equivalently, everything was in the same block pool). During the upgrade, we should hardlink the block files from the {{blocksBeingWritten}} directory into the {{rbw}} directory of this block pool. Similarly, on {{-finalize}}, we should delete the {{blocksBeingWritten}} directory.

            People

            • Assignee:
              Kihwal Lee
              Reporter:
              Suresh Srinivas
            • Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development