Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2190

NN fails to start if it encounters an empty or malformed fstime file

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.203.0
    • Fix Version/s: 0.20.205.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      On startup, the NN reads the fstime file of all the configured dfs.name.dirs to determine which one to load. However, if any of the searched directories contain an empty or malformed fstime file, the NN will fail to start. The NN should be able to just proceed with starting and ignore the directory containing the bad fstime file.

      1. hdfs-2190.0.patch
        4 kB
        Aaron T. Myers
      2. hdfs-2190.1.patch
        9 kB
        Aaron T. Myers

        Activity

        Aaron T. Myers created issue -
        Hide
        Aaron T. Myers added a comment -

        Didn't intend to assign this to myself.

        Show
        Aaron T. Myers added a comment - Didn't intend to assign this to myself.
        Aaron T. Myers made changes -
        Field Original Value New Value
        Assignee Aaron T. Myers [ atm ]
        Aaron T. Myers made changes -
        Assignee Aaron T. Myers [ atm ]
        Hide
        Aaron T. Myers added a comment -

        Patch which addresses the issue. I also took the liberty of removing the vestigial FSImage.getTimeFiles method.

        Show
        Aaron T. Myers added a comment - Patch which addresses the issue. I also took the liberty of removing the vestigial FSImage.getTimeFiles method.
        Aaron T. Myers made changes -
        Attachment hdfs-2190.0.patch [ 12489420 ]
        Hide
        Todd Lipcon added a comment -

        hmm, how did the dir end up with an empty or malformed one? Any idea? Maybe we should also address that problem (perhaps by backporting AtomicFileOutputStream from trunk?)

        Show
        Todd Lipcon added a comment - hmm, how did the dir end up with an empty or malformed one? Any idea? Maybe we should also address that problem (perhaps by backporting AtomicFileOutputStream from trunk?)
        Hide
        Aaron T. Myers added a comment -

        I can't say for sure what caused the truncation, but with the current code there is a race between the fstime file being created and a value being written to it. If the NN were to crash in between the two, this would leave the file empty.

        Updated patch addressing Todd's comments. This back-ports AtomicFileOutputStream from trunk. This patch also takes the further liberty of cleaning up some bad indentation in FSImage.incrementCheckpointTime.

        Show
        Aaron T. Myers added a comment - I can't say for sure what caused the truncation, but with the current code there is a race between the fstime file being created and a value being written to it. If the NN were to crash in between the two, this would leave the file empty. Updated patch addressing Todd's comments. This back-ports AtomicFileOutputStream from trunk. This patch also takes the further liberty of cleaning up some bad indentation in FSImage.incrementCheckpointTime .
        Aaron T. Myers made changes -
        Attachment hdfs-2190.1.patch [ 12489426 ]
        Hide
        Aaron T. Myers added a comment -

        I should've also mentioned, I ran the newly-added test case as well as TestCheckpoint, both of which passed. I can't run test-patch at the moment because of the issues with Apache svn, but will once those are resolved.

        Show
        Aaron T. Myers added a comment - I should've also mentioned, I ran the newly-added test case as well as TestCheckpoint , both of which passed. I can't run test-patch at the moment because of the issues with Apache svn, but will once those are resolved.
        Hide
        Aaron T. Myers added a comment -

        Results of test-patch:

        +1 overall.  
        
            +1 @author.  The patch does not contain any @author tags.
        
            +1 tests included.  The patch appears to include 2 new or modified tests.
        
            +1 javadoc.  The javadoc tool did not generate any warning messages.
        
            +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
        
            +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.
        
        Show
        Aaron T. Myers added a comment - Results of test-patch: +1 overall. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
        Hide
        Todd Lipcon added a comment -

        hm, the change to fix this possible failure seems good. I'm a little nervous about charging on through the missing files at startup. Have you worked through the various conditions where this might be the case? Is there any time when it would be preferable to fail to start up, and make the user manually choose which storage dir to start from?

        Maybe we should just add a config here that the user can use to acknowledge the corruption and move forward? See HDFS-2079

        Show
        Todd Lipcon added a comment - hm, the change to fix this possible failure seems good. I'm a little nervous about charging on through the missing files at startup. Have you worked through the various conditions where this might be the case? Is there any time when it would be preferable to fail to start up, and make the user manually choose which storage dir to start from? Maybe we should just add a config here that the user can use to acknowledge the corruption and move forward? See HDFS-2079
        Hide
        Todd Lipcon added a comment -

        Aaron pointed out offline that we already treat non-existent fstime files like this. So, a truncated one should be treated the same as the non-existent one.

        So, +1.

        Show
        Todd Lipcon added a comment - Aaron pointed out offline that we already treat non-existent fstime files like this. So, a truncated one should be treated the same as the non-existent one. So, +1.
        Hide
        Aaron T. Myers added a comment -

        Thanks a lot for the reviews, Todd. I just committed this to the security branch.

        Show
        Aaron T. Myers added a comment - Thanks a lot for the reviews, Todd. I just committed this to the security branch.
        Aaron T. Myers made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Fix Version/s 0.20.205.0 [ 12316392 ]
        Resolution Fixed [ 1 ]
        Hide
        Matt Foley added a comment -

        Closed upon release of 0.20.205.0

        Show
        Matt Foley added a comment - Closed upon release of 0.20.205.0
        Matt Foley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        16d 21h 44m 1 Aaron T. Myers 08/Aug/11 22:55
        Resolved Resolved Closed Closed
        71d 2h 30m 1 Matt Foley 19/Oct/11 01:26

          People

          • Assignee:
            Aaron T. Myers
            Reporter:
            Aaron T. Myers
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development