[HDFS-4423] Checkpoint exception causes fatal damage to fsimage. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.0.4, 1.1.1
Fix Version/s: 1.1.2
Component/s: namenode
Labels:
None
Environment:

CentOS 6.2

Hadoop Flags:

Reviewed

Description

The impact of class is org.apache.hadoop.hdfs.server.namenode.FSImage.java

boolean loadFSImage(MetaRecoveryContext recovery) throws IOException {
...
latestNameSD.read();
    needToSave |= loadFSImage(getImageFile(latestNameSD, NameNodeFile.IMAGE));
    LOG.info("Image file of size " + imageSize + " loaded in " 
        + (FSNamesystem.now() - startTime)/1000 + " seconds.");
    
    // Load latest edits
    if (latestNameCheckpointTime > latestEditsCheckpointTime)
      // the image is already current, discard edits
      needToSave |= true;
    else // latestNameCheckpointTime == latestEditsCheckpointTime
      needToSave |= (loadFSEdits(latestEditsSD, recovery) > 0);
    
    return needToSave;
  }

If it is the normal flow of the checkpoint,the value of latestNameCheckpointTime is equal to the value of latestEditsCheckpointTime，and it will exec “else”.
The problem is that，latestNameCheckpointTime > latestEditsCheckpointTime：
SecondNameNode starts checkpoint，
...
NameNode：rollFSImage，NameNode shutdown after write latestNameCheckpointTime and before write latestEditsCheckpointTime.
Start NameNode：because latestNameCheckpointTime > latestEditsCheckpointTime，so the value of needToSave is true， and it will not update “rootDir”'s nsCount that is the cluster's file number（update exec at loadFSEdits “FSNamesystem.getFSNamesystem().dir.updateCountForINodeWithQuota()”），and then “saveNamespace” will write file number to fsimage whit default value “1”。
The next time，loadFSImage will fail.

Maybe，it will work:

boolean loadFSImage(MetaRecoveryContext recovery) throws IOException {
...
latestNameSD.read();
    needToSave |= loadFSImage(getImageFile(latestNameSD, NameNodeFile.IMAGE));
    LOG.info("Image file of size " + imageSize + " loaded in " 
        + (FSNamesystem.now() - startTime)/1000 + " seconds.");
    
    // Load latest edits
    if (latestNameCheckpointTime > latestEditsCheckpointTime){
      // the image is already current, discard edits
      needToSave |= true;
      FSNamesystem.getFSNamesystem().dir.updateCountForINodeWithQuota();
    }
    else // latestNameCheckpointTime == latestEditsCheckpointTime
      needToSave |= (loadFSEdits(latestEditsSD, recovery) > 0);
    
    return needToSave;
  }

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-4423-branch-1.1.patch
29/Jan/13 17:04
7 kB
Chris Nauroth

Activity

People

Assignee:: Chris Nauroth

Reporter:: ChenFolin

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 21/Jan/13 02:36

Updated:: 06/Mar/13 09:55

Resolved:: 30/Jan/13 02:55

Time Tracking

Estimated:

72h

Remaining:

72h

Logged:

Not Specified