Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently the DFS cluster upgrade procedure is manual.
      http://wiki.apache.org/lucene-hadoop/Hadoop_Upgrade
      It is rather complicated and does not guarantee data recoverability in case of software errors or administrator mistakes.
      This is a description of utilities that make the upgrade process almost automatic and minimize chance of loosing or corrupting data.
      Please see the attached html file for details.

      1. DFSUpgradeProposal3.html
        53 kB
        Konstantin Shvachko
      2. FSStateTransition6.htm
        97 kB
        Konstantin Shvachko
      3. FSStateTransition7.htm
        114 kB
        Konstantin Shvachko
      4. FSStateTransitionApr03.patch
        239 kB
        Konstantin Shvachko
      5. Manual-TestCases-HdfsConversion.txt
        5 kB
        Nigel Daley
      6. TestPlan-HdfsUpgrade.html
        27 kB
        Nigel Daley
      7. TestPlan-HdfsUpgrade.html
        27 kB
        Nigel Daley
      8. TestPlan-HdfsUpgrade.html
        20 kB
        Nigel Daley

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          Should there also be a '-list' option, that lists all known FSSIDs?

          Also, must rollback always remove the newer version? If changes were made there they will be lost. Someone might want to rollback to revert to an old version to test something, or even to find a deleted file, then switch back to the newer version. In effect these are filesystem checkpoints. We probably don't want to encourage use of them as checkpoints right off, but we also shouldn't do things that prohibit it, like removing versions whenever we switch versions. Thoughts?

          Show
          Doug Cutting added a comment - Should there also be a '-list' option, that lists all known FSSIDs? Also, must rollback always remove the newer version? If changes were made there they will be lost. Someone might want to rollback to revert to an old version to test something, or even to find a deleted file, then switch back to the newer version. In effect these are filesystem checkpoints. We probably don't want to encourage use of them as checkpoints right off, but we also shouldn't do things that prohibit it, like removing versions whenever we switch versions. Thoughts?
          Hide
          Yoram Arnon added a comment -

          there are two things you can do with a snapshot: view individual files or roll back the entire FS.
          We currently plan to support only the latter, more extreme option, typically used only in case of disasters, and that's typically non reversible, though it would be nice to allow the former as well.

          snapshots are immutable, otherwise you get into the business of managing diverging branches, which I'd recommend against.
          Rolling back with the option to roll forward implies your entire FS is read-only, limiting its usefulness, even for testing. The job tracker, for example, won't start on a read-only dfs, let alone execute jobs.

          I'd recommend staying the course with non reversible rollbacks, while keeping in mind the desire for full snapshot functionality in future, which will allow read-only viewing of individual files or directories.

          Show
          Yoram Arnon added a comment - there are two things you can do with a snapshot: view individual files or roll back the entire FS. We currently plan to support only the latter, more extreme option, typically used only in case of disasters, and that's typically non reversible, though it would be nice to allow the former as well. snapshots are immutable, otherwise you get into the business of managing diverging branches, which I'd recommend against. Rolling back with the option to roll forward implies your entire FS is read-only, limiting its usefulness, even for testing. The job tracker, for example, won't start on a read-only dfs, let alone execute jobs. I'd recommend staying the course with non reversible rollbacks, while keeping in mind the desire for full snapshot functionality in future, which will allow read-only viewing of individual files or directories.
          Hide
          Konstantin Shvachko added a comment -

          Doug,
          Do you mean a shell -list option to list FSSIDs?
          Sure, if we displayed them in the web UI why not report the same via a shell command.

          You actually can do rollback and preserve current version.
          You need to upgrade current version to some new FSSID first then rollback to the old version.
          But I agree this not very convenient.
          So do we want to separate taking a snapshot to a special function?
          It will still be a part of upgrade, we are just adding an api to call it independently?

          Yoram,
          I don't think we should require snapshots to be immutable.
          The hard link scheme lets different versions coexist and being modified independently of each other.

          Show
          Konstantin Shvachko added a comment - Doug, Do you mean a shell -list option to list FSSIDs? Sure, if we displayed them in the web UI why not report the same via a shell command. You actually can do rollback and preserve current version. You need to upgrade current version to some new FSSID first then rollback to the old version. But I agree this not very convenient. So do we want to separate taking a snapshot to a special function? It will still be a part of upgrade, we are just adding an api to call it independently? Yoram, I don't think we should require snapshots to be immutable. The hard link scheme lets different versions coexist and being modified independently of each other.
          Hide
          Yoram Arnon added a comment -

          read-write snapshots are possible, maybe even 'neat' from a technology standpoint.
          They're just hard to manage. I know of no file systems then implement them though they're common in revision control systems.
          I recommend against it in hdfs.

          Show
          Yoram Arnon added a comment - read-write snapshots are possible, maybe even 'neat' from a technology standpoint. They're just hard to manage. I know of no file systems then implement them though they're common in revision control systems. I recommend against it in hdfs.
          Hide
          Raghu Angadi added a comment -


          With the manual rollback on each of the nodes on the cluster, I think we will need a way to know if a data-node is connecting to namenode with wrong fs version because there will be some datanodes which did not run the rollback procedure. One indirect way for namenode to recognize such nodes is to check if "latest stored version" on datanode is "later" than namenode's. What should Namenode do if it notices such a datanode? Since rollback is supposed to be rare, it could 'fail fast' and somehow let admin to fix such nodes.

          Show
          Raghu Angadi added a comment - With the manual rollback on each of the nodes on the cluster, I think we will need a way to know if a data-node is connecting to namenode with wrong fs version because there will be some datanodes which did not run the rollback procedure. One indirect way for namenode to recognize such nodes is to check if "latest stored version" on datanode is "later" than namenode's. What should Namenode do if it notices such a datanode? Since rollback is supposed to be rare, it could 'fail fast' and somehow let admin to fix such nodes.
          Hide
          Konstantin Shvachko added a comment -

          I substantially modified the upgrade proposal.

          • There will be more changes in the data layout.
            Please let me know if something is not satisfactory. E.g. changing file or directory
            names will be very hard with backward compatibility and up-/de-gradability in mind.
          • Taking into account massive layout changes and also that upgrades with rollback are
            hard to support for versions with different directory structures, I propose to keep automatic
            data conversion (as we had until now) from the current version to the next one.
            No going back (rollback) and force (upgrade) between current and the next versions.
            The upgrades will be supported once the data is converted to the new format.
            Hope that makes sense.
          • The proposal contains more details to cover different failure scenarios.
            Like data-node started the upgrade or discard process but crashed before completing.
          • The discard and the rollback commands are not commands, but rather server startup
            options. We do not have administrative authorization and this should make harder to
            do an upgrade, discard or rollback by mistake.
          Show
          Konstantin Shvachko added a comment - I substantially modified the upgrade proposal. There will be more changes in the data layout. Please let me know if something is not satisfactory. E.g. changing file or directory names will be very hard with backward compatibility and up-/de-gradability in mind. Taking into account massive layout changes and also that upgrades with rollback are hard to support for versions with different directory structures, I propose to keep automatic data conversion (as we had until now) from the current version to the next one. No going back (rollback) and force (upgrade) between current and the next versions. The upgrades will be supported once the data is converted to the new format. Hope that makes sense. The proposal contains more details to cover different failure scenarios. Like data-node started the upgrade or discard process but crashed before completing. The discard and the rollback commands are not commands, but rather server startup options. We do not have administrative authorization and this should make harder to do an upgrade, discard or rollback by mistake.
          Hide
          Raghu Angadi added a comment -

          The latest proposal includes <BuildVersion> in directory name and also includes in "VERSION" file. I am not sure what BuildVersion would be used for. Each backed up directory is uniquely identified by FFSID. If we include build version also in its name, it gives an impression that directory is somehow connected to build version. But it is not. Build version will change often and increases number of things to consider in the code and error handling.

          Show
          Raghu Angadi added a comment - The latest proposal includes <BuildVersion> in directory name and also includes in "VERSION" file. I am not sure what BuildVersion would be used for. Each backed up directory is uniquely identified by FFSID. If we include build version also in its name, it gives an impression that directory is somehow connected to build version. But it is not. Build version will change often and increases number of things to consider in the code and error handling.
          Hide
          Nigel Daley added a comment -

          A test plan for DFS upgrades. Review comments welcome.

          Show
          Nigel Daley added a comment - A test plan for DFS upgrades. Review comments welcome.
          Hide
          Konstantin Shvachko added a comment -

          Current proposal is now in DFSUpgradeProposal3.html
          And FSStateTransition.html contains more detailed algorithms.

          Show
          Konstantin Shvachko added a comment - Current proposal is now in DFSUpgradeProposal3.html And FSStateTransition.html contains more detailed algorithms.
          Hide
          Nigel Daley added a comment -

          Updated the test plan to reflect the latest design.

          Show
          Nigel Daley added a comment - Updated the test plan to reflect the latest design.
          Hide
          Sameer Paranjpye added a comment -

          After some discussion with Konstantin, Milind, Owen and Nigel it feels like we need some amendments to the design for upgrade and rollbacks. The most significant delta is in the area of keeping multiple snapshots with different FSSIDs.

          The fundamental problem with allowing multiple FSSIDs each representing a different filesystem state is that these 'snapshots' decay over time unless they are actively managed. There is no monitoring and replication of
          blocks in a snapshot. Datanodes going down can cause bit rot and data loss. Data corruption also goes undetected since clients never read from snapshots. Allowing multiple FSSIDs also causes the number of states the filesystem can be in to grow significantly and the number of corner cases that need to be handled to explode (particularly on the datanodes). Further, the primary motivation for this design is to protect filesystem data in the face of software upgrades and rollbacks. Snapshots were a side-effect of the design but they don't feel like a hard requirement at this point.

          The other important change is much tighter integration of the Namenode and Datanodes. The new design requires that the Namenode and Datanodes be running the same software version. This is a much stricter requirement than having them speaking the same protocol versions. But given that replication and layout can change with software revisions it seems reasonable to enforce. Note that this does not affect HDFS clients, which continue to require protocol compatibility only.

          Konstantin will be publishing an updated document shortly.

          Show
          Sameer Paranjpye added a comment - After some discussion with Konstantin, Milind, Owen and Nigel it feels like we need some amendments to the design for upgrade and rollbacks. The most significant delta is in the area of keeping multiple snapshots with different FSSIDs. The fundamental problem with allowing multiple FSSIDs each representing a different filesystem state is that these 'snapshots' decay over time unless they are actively managed. There is no monitoring and replication of blocks in a snapshot. Datanodes going down can cause bit rot and data loss. Data corruption also goes undetected since clients never read from snapshots. Allowing multiple FSSIDs also causes the number of states the filesystem can be in to grow significantly and the number of corner cases that need to be handled to explode (particularly on the datanodes). Further, the primary motivation for this design is to protect filesystem data in the face of software upgrades and rollbacks. Snapshots were a side-effect of the design but they don't feel like a hard requirement at this point. The other important change is much tighter integration of the Namenode and Datanodes. The new design requires that the Namenode and Datanodes be running the same software version. This is a much stricter requirement than having them speaking the same protocol versions. But given that replication and layout can change with software revisions it seems reasonable to enforce. Note that this does not affect HDFS clients, which continue to require protocol compatibility only. Konstantin will be publishing an updated document shortly.
          Hide
          Konstantin Shvachko added a comment -

          I'd like to emphasize the changes in the design of the upgrade and the behavior of the system in general.
          People expressed different opinions during previous discussion so if anybody sees problems with the new
          approach now would be a good time to speak up.

          • No FSSIDs means that there will no possibility to create multiple snapshots of the fs.
            Only one snapshot at any given time.
            Something like what Dough calls above "filesystem checkpoints" will not be possible any more.
          • The requirement of exact release version match will result in that there will be no option for administrators
            to stop the name-node (without stopping data-nodes) and restart it with updated software. Even if no
            changes to the data layout or data-node protocol have been done.
          • Another important issue in the new design is that data-nodes will decide on their own whether to upgrade
            or discard old fs state based on comparison of the local data layout version and the name-node LV.
            That is, even if you start name-node in regular mode some data-nodes, which missed previous upgrade(s)
            or discard(s), can decide to do it on their own.

          I wrote a test that creates hard links of block files in a new directory. On my machine a hard link creation
          takes about 10 milliseconds, which is 6,000 blocks per minute.
          Depending on your data-node size you can calculate the cluster startup delay.

          Show
          Konstantin Shvachko added a comment - I'd like to emphasize the changes in the design of the upgrade and the behavior of the system in general. People expressed different opinions during previous discussion so if anybody sees problems with the new approach now would be a good time to speak up. No FSSIDs means that there will no possibility to create multiple snapshots of the fs. Only one snapshot at any given time. Something like what Dough calls above "filesystem checkpoints" will not be possible any more. The requirement of exact release version match will result in that there will be no option for administrators to stop the name-node (without stopping data-nodes) and restart it with updated software. Even if no changes to the data layout or data-node protocol have been done. Another important issue in the new design is that data-nodes will decide on their own whether to upgrade or discard old fs state based on comparison of the local data layout version and the name-node LV. That is, even if you start name-node in regular mode some data-nodes, which missed previous upgrade(s) or discard(s), can decide to do it on their own. I wrote a test that creates hard links of block files in a new directory. On my machine a hard link creation takes about 10 milliseconds, which is 6,000 blocks per minute. Depending on your data-node size you can calculate the cluster startup delay.
          Hide
          Yoram Arnon added a comment -

          losing the capability for multiple snapshots is regrettable, but if the snapshots aren't maintained and are just left there to rot then perhaps it's not such a bad thing. A full snapshots solution will need to wait until it's done, well, fully.

          Not being able to restart just the name node with an upgraded version is regrettable too, since we've seen cases where a tiny namenode bug is fixed and it's much simpler to update just one node than to update the entire cluster. Can that limitation be relaxed?

          Datanodes automatically "catching up" on a missed previous upgrade/discard seems like a good thing, isn't it?

          The benchmark - was it executed on a machine with a single disk or several? How fast can links be created (and deleted) on a machine with several disks?

          Show
          Yoram Arnon added a comment - losing the capability for multiple snapshots is regrettable, but if the snapshots aren't maintained and are just left there to rot then perhaps it's not such a bad thing. A full snapshots solution will need to wait until it's done, well, fully. Not being able to restart just the name node with an upgraded version is regrettable too, since we've seen cases where a tiny namenode bug is fixed and it's much simpler to update just one node than to update the entire cluster. Can that limitation be relaxed? Datanodes automatically "catching up" on a missed previous upgrade/discard seems like a good thing, isn't it? The benchmark - was it executed on a machine with a single disk or several? How fast can links be created (and deleted) on a machine with several disks?
          Hide
          Raghu Angadi added a comment -

          I vote against requiring strict build version match between datanodes and namenode.. especially if it results in datanodes being marked dead in case of mismatch. Unless there is an easy way to disable,

          1) In practice its hard to make sure every node is running the same software version ALL the time. Pretty soon we might a have case we forgot or rsync failed to push to half the nodes and name nodes suddenly looses a large chunk of data as a result.

          2) As Konstantin mentioned, even in a test cluster this makes active testing hard. If i am working on a small namenode feature on not so small test cluster, it would require me to push new software and restart the whole cluster many times, also increasing possibility of (1).

          Show
          Raghu Angadi added a comment - I vote against requiring strict build version match between datanodes and namenode.. especially if it results in datanodes being marked dead in case of mismatch. Unless there is an easy way to disable, 1) In practice its hard to make sure every node is running the same software version ALL the time. Pretty soon we might a have case we forgot or rsync failed to push to half the nodes and name nodes suddenly looses a large chunk of data as a result. 2) As Konstantin mentioned, even in a test cluster this makes active testing hard. If i am working on a small namenode feature on not so small test cluster, it would require me to push new software and restart the whole cluster many times, also increasing possibility of (1).
          Hide
          Yoram Arnon added a comment -

          the last comment leads me to thinking about online upgrades. While this is a ways off, at some point we'll want to upgrade the dfs gradually, without bringing it down, especially for minor changes. I envision upgrading the namenode, which is backwards compatible with the previous version of the datanodes, and having the datanodes upgrade gradually later.
          That would require allowing a version mismatch between namenode and datanodes.

          Show
          Yoram Arnon added a comment - the last comment leads me to thinking about online upgrades. While this is a ways off, at some point we'll want to upgrade the dfs gradually, without bringing it down, especially for minor changes. I envision upgrading the namenode, which is backwards compatible with the previous version of the datanodes, and having the datanodes upgrade gradually later. That would require allowing a version mismatch between namenode and datanodes.
          Hide
          Konstantin Shvachko added a comment -

          > Data-nodes automatically "catching up" on a missed previous upgrade/discard seems like a good thing, isn't it?

          Yes, but!
          We are trying to protect the system from human mistakes. Now suppose that adminK started the overnight upgrade
          of the system before going home, he plans to come back in the morning and check whether the upgrade was successful
          or not. But another adminY comes to work earlier and not knowing about adminK actions last night starts the upgrade again.
          The data-nodes will automatically discard "previous" fs state before upgrading because they can store only one backup per node.
          So it can automatically discard the last working state of the file system if the upgraded software had bugs affecting the namespace.
          I see it as the main problem with our new approach.

          Show
          Konstantin Shvachko added a comment - > Data-nodes automatically "catching up" on a missed previous upgrade/discard seems like a good thing, isn't it? Yes, but! We are trying to protect the system from human mistakes. Now suppose that adminK started the overnight upgrade of the system before going home, he plans to come back in the morning and check whether the upgrade was successful or not. But another adminY comes to work earlier and not knowing about adminK actions last night starts the upgrade again. The data-nodes will automatically discard "previous" fs state before upgrading because they can store only one backup per node. So it can automatically discard the last working state of the file system if the upgraded software had bugs affecting the namespace. I see it as the main problem with our new approach.
          Hide
          Konstantin Shvachko added a comment -

          This is the updated document: FSStateTransition5.htm
          I tried to combine the two design documents into one.

          Show
          Konstantin Shvachko added a comment - This is the updated document: FSStateTransition5.htm I tried to combine the two design documents into one.
          Hide
          Nigel Daley added a comment -

          Attached updated test plan for the latest design doc. There are certainly errors in the "expected response" sections of this document, largely due to missing details in the design doc.

          Show
          Nigel Daley added a comment - Attached updated test plan for the latest design doc. There are certainly errors in the "expected response" sections of this document, largely due to missing details in the design doc.
          Hide
          Raghu Angadi added a comment -

          How about enforcing buildVersion match only when we are rollingback ( and may be while upgrading and finalizing.. ).

          Show
          Raghu Angadi added a comment - How about enforcing buildVersion match only when we are rollingback ( and may be while upgrading and finalizing.. ).
          Hide
          Konstantin Shvachko added a comment -

          This is the patch that fully implements the design in the updated document.
          I updated three versions for ClientProtocol, DatanodeProtocol and the LAYOUT_VERSION,
          which previously used to be called DFS_CURRENT_VERSION.

          New code enforces more strict version checking: if a data-node has different from the
          name-node build version then it fails, even if the layout and protocol versions are the same.
          The build version is checked during handshake - a new rpc call which happens before registration.

          The -upgrade feature can be used immediately although it is not mandatory.
          The expected behavior is that the old fs layout will be first converted into the new layout, and then
          saved in directory "previous". "current" directory will contain the new file system state.
          All old files (in "previous") will remain unmodified, and can be restored in case of failure.
          The rollback will not restore the pre-upgrade layout as pointed out in the design doc.

          After applying the upgrade patch I recommend to actually upgrade

          • start the cluster with the -upgrade option
          • run fsck and some tests
          • bin/hadoop dfsadmin -finalizeUpgrade
            If something failed during conversion or later on I do not recommend to use rollback as a recovery procedure.
            In order to recover the pre-upgrade state and layout from the "previous" directory one should manually rename files, namely:
            for NameNode
            mv previous/edits ../
            rm previous/VERSION
            mv previous image
            rm current
            for DataNode
            mv previous/storage ../
            rm previous/VERSION
            mv previous data
            rm current

          Other changes and future work.

          • The name-node image file format has not been changed, and it still contains the layout version and the namespace ID,
            which are redundant now. The reason for that is that it would make failure during the conversion unrecoverable.
            If the image is converted but the name-node fails before writing down the version file, the namespace id and the LV will be lost.
            The image file format should be changed sometimes later.
          • I deprecated some methods. Most of then will need to be removed in a subsequent patch.
          • Name-node is locking the storage directory now, the same as data-nodes, so no one can start
            two name-nodes in the same directory from now on.
          • I removed unused code in FSEditLog and SecondaryNameNode. This is related to HADOOP-1076 (2)
          • In FSEditLog I replaced 4 arrays by one and eliminated duplicate code.
          • I changed MiniDFSCluster to sleep for 2 seconds before starting each data-node.
            Otherwise many tests were failing, because data-nodes were rolling ports.
            This is not a good fix, we will need to find out why this is happening.

          Thanks Raghu for reviewing the code and helping with testing.
          Thanks Nigel for testing and for creating a comprehensive junit test that covers at least 134 test cases
          related to the new functionality.

          Show
          Konstantin Shvachko added a comment - This is the patch that fully implements the design in the updated document. I updated three versions for ClientProtocol, DatanodeProtocol and the LAYOUT_VERSION, which previously used to be called DFS_CURRENT_VERSION. New code enforces more strict version checking: if a data-node has different from the name-node build version then it fails, even if the layout and protocol versions are the same. The build version is checked during handshake - a new rpc call which happens before registration. The -upgrade feature can be used immediately although it is not mandatory. The expected behavior is that the old fs layout will be first converted into the new layout, and then saved in directory "previous". "current" directory will contain the new file system state. All old files (in "previous") will remain unmodified, and can be restored in case of failure. The rollback will not restore the pre-upgrade layout as pointed out in the design doc. After applying the upgrade patch I recommend to actually upgrade start the cluster with the -upgrade option run fsck and some tests bin/hadoop dfsadmin -finalizeUpgrade If something failed during conversion or later on I do not recommend to use rollback as a recovery procedure. In order to recover the pre-upgrade state and layout from the "previous" directory one should manually rename files, namely: for NameNode mv previous/edits ../ rm previous/VERSION mv previous image rm current for DataNode mv previous/storage ../ rm previous/VERSION mv previous data rm current Other changes and future work. The name-node image file format has not been changed, and it still contains the layout version and the namespace ID, which are redundant now. The reason for that is that it would make failure during the conversion unrecoverable. If the image is converted but the name-node fails before writing down the version file, the namespace id and the LV will be lost. The image file format should be changed sometimes later. I deprecated some methods. Most of then will need to be removed in a subsequent patch. Name-node is locking the storage directory now, the same as data-nodes, so no one can start two name-nodes in the same directory from now on. I removed unused code in FSEditLog and SecondaryNameNode. This is related to HADOOP-1076 (2) In FSEditLog I replaced 4 arrays by one and eliminated duplicate code. I changed MiniDFSCluster to sleep for 2 seconds before starting each data-node. Otherwise many tests were failing, because data-nodes were rolling ports. This is not a good fix, we will need to find out why this is happening. Thanks Raghu for reviewing the code and helping with testing. Thanks Nigel for testing and for creating a comprehensive junit test that covers at least 134 test cases related to the new functionality.
          Hide
          Nigel Daley added a comment -

          +1 This passes unit tests against trunk revision 520995

          Show
          Nigel Daley added a comment - +1 This passes unit tests against trunk revision 520995
          Hide
          Owen O'Malley added a comment -

          -1

          I strongly dislike adding sleep statements to the test. Please add proper synchronization to remove the need for sleeping.

          Show
          Owen O'Malley added a comment - -1 I strongly dislike adding sleep statements to the test. Please add proper synchronization to remove the need for sleeping.
          Hide
          Konstantin Shvachko added a comment -

          There are at least 2 separate issues related to Owen's comment.
          See HADOOP-1063 and HADOOP-1075.
          I think the comment is based solely on the description I posted.
          I'd prefer to see a more thorough review.

          Show
          Konstantin Shvachko added a comment - There are at least 2 separate issues related to Owen's comment. See HADOOP-1063 and HADOOP-1075 . I think the comment is based solely on the description I posted. I'd prefer to see a more thorough review.
          Hide
          Raghu Angadi added a comment -

          +1.

          Show
          Raghu Angadi added a comment - +1.
          Hide
          Raghu Angadi added a comment -

          +1 for the patch. I reviewed it and did some basic testing.

          Show
          Raghu Angadi added a comment - +1 for the patch. I reviewed it and did some basic testing.
          Hide
          Nigel Daley added a comment -

          I'll reworking some of the tests to address Owen's concerns.
          The patch also needs to be updated for the latest trunk.
          I'll create a test-nightly target to run 4 of the new tests which take a while to run. The target will run all tests that start with the word "Nightly".

          Show
          Nigel Daley added a comment - I'll reworking some of the tests to address Owen's concerns. The patch also needs to be updated for the latest trunk. I'll create a test-nightly target to run 4 of the new tests which take a while to run. The target will run all tests that start with the word "Nightly".
          Hide
          Konstantin Shvachko added a comment -

          Multiple patches have been committed that substantially improved junit tests.
          With those patches the upgrade tests now run for about 3 minutes total, so "Nightly" target is no longer necessary.
          Sleeps in MiniDFSCluster or upgrade unit tests are not required any more.

          Show
          Konstantin Shvachko added a comment - Multiple patches have been committed that substantially improved junit tests. With those patches the upgrade tests now run for about 3 minutes total, so "Nightly" target is no longer necessary. Sleeps in MiniDFSCluster or upgrade unit tests are not required any more.
          Show
          Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12354886/FSStateTransitionApr03.patch applied and successfully tested against trunk revision http://svn.apache.org/repos/asf/lucene/hadoop/trunk/525268 . Results are at http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Konstantin!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Konstantin!
          Hide
          Owen O'Malley added a comment -

          My objections have been addressed and I think this should be committed. There are a couple of things that I'd like cleaned up eventually, but they shouldn't block the patch at this point, in my opinion.
          1. The UpgradeUtilities in test should be merged with mini-dfs cluster.
          2. The static class in FileUtils for HardLink seems unnecessary.
          3. FileUtils.HardLink.createLink handles InterruptedException badly.

          Show
          Owen O'Malley added a comment - My objections have been addressed and I think this should be committed. There are a couple of things that I'd like cleaned up eventually, but they shouldn't block the patch at this point, in my opinion. 1. The UpgradeUtilities in test should be merged with mini-dfs cluster. 2. The static class in FileUtils for HardLink seems unnecessary. 3. FileUtils.HardLink.createLink handles InterruptedException badly.
          Hide
          Hadoop QA added a comment -
          Show
          Hadoop QA added a comment - Integrated in Hadoop-Nightly #47 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/47/ )
          Hide
          Nigel Daley added a comment -

          Attaching a writeup of the manual tests I ran to test the directory structure conversion from pre-0.13 to 0.13.

          Show
          Nigel Daley added a comment - Attaching a writeup of the manual tests I ran to test the directory structure conversion from pre-0.13 to 0.13.
          Hide
          Konstantin Shvachko added a comment -

          Attaching FSStateTransition7.htm, which contains new name-node recovery table, and explains how saveNamespace() works.

          Show
          Konstantin Shvachko added a comment - Attaching FSStateTransition7.htm, which contains new name-node recovery table, and explains how saveNamespace() works.
          Hide
          SreeHari added a comment -

          What is the expectation in the following scenario :

          Namenode has 3 name dirs configured ..
          1) Namenode upgrade starts - Upgrade fails after 1st directory is upgraded (2nd and 3rd dir is left unchanged ..)
          2) Namenode starts
          3) Namenode shutdown and rollbacked

          Are the new changes done meant to be visible ? With current implementation , after a rollback new changes are visible .. Is this expected ??

          As per my observation , since Namenode is saving the latest image dir(the upgraded 1st dir since checkpointtime is incremented during upgrade for this dir) will be loaded and saved to all dirs during loadfsimage ..

          But if a ROLLBACK is done , the 1st dir will be rolled back (the older copy becomes current and its checkpointtime is now LESS than other dirs ..) and others left behind since they dont contain previous .. Now during loadfsimage , the 2nd dir will be selected since it has the highest checkpoint time and saved to all dirs (including 1st ) .. Now due to this , the new changes b/w UPGRADE and ROLLBACK present in 2nd dir gets reflected even after ROLLBACK ..

          Is this expected ? I ask this since :

          1) this is not the case with a SUCCESSFULL Upgrade/Rollback (New changes lost after rollback).. &
          2) FSStateTransition7.htm says

          { r0. if previous directory does not exist then fail rollback }

          , but it continues in the current implementation ..

          Show
          SreeHari added a comment - What is the expectation in the following scenario : Namenode has 3 name dirs configured .. 1) Namenode upgrade starts - Upgrade fails after 1st directory is upgraded (2nd and 3rd dir is left unchanged ..) 2) Namenode starts 3) Namenode shutdown and rollbacked Are the new changes done meant to be visible ? With current implementation , after a rollback new changes are visible .. Is this expected ?? As per my observation , since Namenode is saving the latest image dir(the upgraded 1st dir since checkpointtime is incremented during upgrade for this dir) will be loaded and saved to all dirs during loadfsimage .. But if a ROLLBACK is done , the 1st dir will be rolled back (the older copy becomes current and its checkpointtime is now LESS than other dirs ..) and others left behind since they dont contain previous .. Now during loadfsimage , the 2nd dir will be selected since it has the highest checkpoint time and saved to all dirs (including 1st ) .. Now due to this , the new changes b/w UPGRADE and ROLLBACK present in 2nd dir gets reflected even after ROLLBACK .. Is this expected ? I ask this since : 1) this is not the case with a SUCCESSFULL Upgrade/Rollback (New changes lost after rollback).. & 2) FSStateTransition7.htm says { r0. if previous directory does not exist then fail rollback } , but it continues in the current implementation ..

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Konstantin Shvachko
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development