Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-686

NullPointerException is thrown while merging edit log and image

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20.1, 0.21.0
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Error merging image and edits corrected.

      Description

      Our secondary name node is not able to start on NullPointerException:
      ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException
      at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1232)
      at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1221)
      at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:776)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
      at
      org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:590)
      at
      org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:473)
      at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:350)
      at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:314)
      at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)
      at java.lang.Thread.run(Thread.java:619)

      This was caused by setting access time on a non-existent file.

      1. nullSetTime.patch
        1 kB
        Hairong Kuang
      2. openNPE-trunk.patch
        1 kB
        Hairong Kuang
      3. openNPE.patch
        1 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Hairong Kuang added a comment -

          Hi Andrew, it looks that the stack trace that you posted is more like the one described in HDFS-1002. From the image & edits that Ryan posted to HDFS-1002, it seems to me that HDFS may have missed some edit logs. So I am guessing that it was caused by HDFS-955.

          Carlos, thanks a lot for all the information that you provided. Please give me sometime to exam it. Since this jira is closed, could you please open a new jira or we could continue discussion in HDFS-1002.

          Show
          Hairong Kuang added a comment - Hi Andrew, it looks that the stack trace that you posted is more like the one described in HDFS-1002 . From the image & edits that Ryan posted to HDFS-1002 , it seems to me that HDFS may have missed some edit logs. So I am guessing that it was caused by HDFS-955 . Carlos, thanks a lot for all the information that you provided. Please give me sometime to exam it. Since this jira is closed, could you please open a new jira or we could continue discussion in HDFS-1002 .
          Hide
          Carlos Valiente added a comment -

          I'm seeing this problem on a 4-node cluster running 0.21.0, revision 916530:

          2010-03-25 19:57:33,538 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException
                  at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1399)
                  at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1387)
                  at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:672)
                  at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:401)
                  at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:368)
                  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1172)
                  at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:594)
                  at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:476)
                  at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:353)
                  at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:317)
                  at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:219)
                  at java.lang.Thread.run(Thread.java:619)
          

          The content of one of my dfs.name.dir is available at http://static.ecmwf.int/hadoop/dfs-2010-03-26-08-44.tar.bz2.

          I've put a consolidated log file of all Hadoop processes at http://static.ecmwf.int/hadoop/hadoop-2010-03-25.log.bz2. The NPE happens at line 206027. When the NPE was thrown, there were serveral writes taking place.

          Not sure if it matters, but according to my Nagios monitor, one of the data nodes (web211, 136.156.30.211) run out of disk space at 2010-03-25 17:55:26 (about a minute before the edit log roll started)

          Show
          Carlos Valiente added a comment - I'm seeing this problem on a 4-node cluster running 0.21.0, revision 916530: 2010-03-25 19:57:33,538 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1399) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1387) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:672) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:401) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:368) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1172) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:594) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:476) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:353) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:317) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:219) at java.lang.Thread.run(Thread.java:619) The content of one of my dfs.name.dir is available at http://static.ecmwf.int/hadoop/dfs-2010-03-26-08-44.tar.bz2 . I've put a consolidated log file of all Hadoop processes at http://static.ecmwf.int/hadoop/hadoop-2010-03-25.log.bz2 . The NPE happens at line 206027. When the NPE was thrown, there were serveral writes taking place. Not sure if it matters, but according to my Nagios monitor, one of the data nodes (web211, 136.156.30.211) run out of disk space at 2010-03-25 17:55:26 (about a minute before the edit log roll started)
          Hide
          Andrew Purtell added a comment -

          Hairong,

          We are using 0.20.

          We are trying to recover a corrupt volume with about 3 TB of data we'd like to get back. We have followed the steps on the page http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html#Secondary+NameNode but are getting exceptions that look very similar to the ones posted on this issue.

          We tried to use the openNEP patch. The result looks the same:

          2010-02-26 05:47:17,598 ERROR org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Unexpected block size: -3664558185340993536
          2010-02-26 05:47:17,600 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1007)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:993)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:967)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:954)
          at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696)
          at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1001)
          at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
          at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) It doesn't work.

          Btw, we tried to skip all of these corrupt record by modifying Block.java. However, we then saw this:

          2010-02-26 03:34:58,627 ERROR org.apache.hadoop.hdfs.server.common.Storage: java.lang.IllegalArgumentException: No enum const class org.apache.hadoop.hdfs.protocol.DatanodeInfo$AdminStates.hive_2009_05_31_VSAPI_001_12.VSAPI_001.10637.0.archive.pb$����&;=��ٴ��
          2010-02-26 03:34:58,629 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1007)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:993)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:967)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:954)
          at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696)
          at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1001)
          at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
          at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

          It looks like some field in the record is not a valid value. Do you have any idea?

          Can we provide you the snapshot to take a look at? We would really like to recover this volume.

          Show
          Andrew Purtell added a comment - Hairong, We are using 0.20. We are trying to recover a corrupt volume with about 3 TB of data we'd like to get back. We have followed the steps on the page http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html#Secondary+NameNode but are getting exceptions that look very similar to the ones posted on this issue. We tried to use the openNEP patch. The result looks the same: 2010-02-26 05:47:17,598 ERROR org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Unexpected block size: -3664558185340993536 2010-02-26 05:47:17,600 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1007) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:993) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:967) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:954) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1001) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) It doesn't work. Btw, we tried to skip all of these corrupt record by modifying Block.java. However, we then saw this: 2010-02-26 03:34:58,627 ERROR org.apache.hadoop.hdfs.server.common.Storage: java.lang.IllegalArgumentException: No enum const class org.apache.hadoop.hdfs.protocol.DatanodeInfo$AdminStates.hive_2009_05_31_VSAPI_001_12.VSAPI_001.10637.0.archive.pb$����&;=��ٴ�� 2010-02-26 03:34:58,629 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1007) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:993) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:967) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:954) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1001) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) It looks like some field in the record is not a valid value. Do you have any idea? Can we provide you the snapshot to take a look at? We would really like to recover this volume.
          Hide
          Hairong Kuang added a comment -

          The fix is in 0.20, 0.21, and the trunk. Please check if the version that you are running has the fix: openNPE.patch.

          If yes, as I said, the bug may be caused by a different race condition from the one that I fixed. Need more information from the edit log & NameNode log to figure out the cause the problem, please open a different jira for this.

          Show
          Hairong Kuang added a comment - The fix is in 0.20, 0.21, and the trunk. Please check if the version that you are running has the fix: openNPE.patch. If yes, as I said, the bug may be caused by a different race condition from the one that I fixed. Need more information from the edit log & NameNode log to figure out the cause the problem, please open a different jira for this.
          Hide
          dhruba borthakur added a comment -

          Ryan/Andrew: are you both seeing this on hadoop 0.20 or hadoop 0.21?

          Show
          dhruba borthakur added a comment - Ryan/Andrew: are you both seeing this on hadoop 0.20 or hadoop 0.21?
          Hide
          Andrew Purtell added a comment -

          We saw this also today I think:

          2010-02-25 03:24:11,725 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:992)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:966)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:953)
          at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696)
          at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1001)
          at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
          at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

          Trying to recover a corrupted FS image from snapshot. Currently a no-go.

          Show
          Andrew Purtell added a comment - We saw this also today I think: 2010-02-25 03:24:11,725 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:992) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:966) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:953) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1001) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) Trying to recover a corrupted FS image from snapshot. Currently a no-go.
          Hide
          ryan rawson added a comment -

          this just happened again on me:

          2010-02-24 05:02:48,054 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1152)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1164)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1067)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:213)
          at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:511)
          at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:401)
          at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:368)
          at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1172)
          at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:594)
          at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:476)
          at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:353)
          at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:317)
          at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:219)
          at java.lang.Thread.run(Thread.java:619)

          I can provide image and edits.

          Show
          ryan rawson added a comment - this just happened again on me: 2010-02-24 05:02:48,054 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1152) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1164) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1067) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:213) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:511) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:401) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:368) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1172) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:594) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:476) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:353) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:317) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:219) at java.lang.Thread.run(Thread.java:619) I can provide image and edits.
          Hide
          ryan rawson added a comment -

          Why are you closing this as "fixed" when it clearly isn't?

          Show
          ryan rawson added a comment - Why are you closing this as "fixed" when it clearly isn't?
          Hide
          Hairong Kuang added a comment -

          nullSetTime.patch is not supposed to be committed. It is used for those users who came across the problem and get their the NN or SNN up.

          Ryan, if you continue to see this bug in 0.21, please file another jira and provide mroe details on your edit log & NN logs. There might be another race condition in the code.

          Let me close this jira for now.

          Show
          Hairong Kuang added a comment - nullSetTime.patch is not supposed to be committed. It is used for those users who came across the problem and get their the NN or SNN up. Ryan, if you continue to see this bug in 0.21, please file another jira and provide mroe details on your edit log & NN logs. There might be another race condition in the code. Let me close this jira for now.
          Hide
          dhruba borthakur added a comment -

          I see that this patch was already committed to 0.21. Isn't it true, Todd?

          Show
          dhruba borthakur added a comment - I see that this patch was already committed to 0.21. Isn't it true, Todd?
          Hide
          Todd Lipcon added a comment -

          Reopening to address Ryan's question above.

          Show
          Todd Lipcon added a comment - Reopening to address Ryan's question above.
          Hide
          ryan rawson added a comment -

          here is my stack trace:
          2010-02-13 21:46:54,748 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Throwable Exception in doCheckpoint:
          2010-02-13 21:46:54,749 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1399)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1387)
          at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:672)
          at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:401)
          at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:368)
          at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1172)
          at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:594)
          at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:476)
          at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:353)
          at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:317)
          at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:219)
          at java.lang.Thread.run(Thread.java:619)

          Show
          ryan rawson added a comment - here is my stack trace: 2010-02-13 21:46:54,748 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Throwable Exception in doCheckpoint: 2010-02-13 21:46:54,749 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1399) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1387) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:672) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:401) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:368) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1172) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:594) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:476) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:353) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:317) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:219) at java.lang.Thread.run(Thread.java:619)
          Hide
          ryan rawson added a comment -

          Hey guys, I ran into this on branch-0.21, probably because the patch contained in the file "nullSetTime.patch" was never committed!

          Why would this issue be closed?!

          Show
          ryan rawson added a comment - Hey guys, I ran into this on branch-0.21, probably because the patch contained in the file "nullSetTime.patch" was never committed! Why would this issue be closed?!
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #78 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/78/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #78 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/78/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #79 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/79/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #79 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/79/ )
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #47 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/47/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #47 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/47/ )
          Hide
          dhruba borthakur added a comment -

          I agree, it is difficult to write a unit test for this one. And the change is v simple.

          Show
          dhruba borthakur added a comment - I agree, it is difficult to write a unit test for this one. And the change is v simple.
          Hide
          Hairong Kuang added a comment -

          If you already experienced this problem either when starting namenode or secondary namenode, you can apply the attached patch to get your node up again. This is a patch for 0.20.

          Show
          Hairong Kuang added a comment - If you already experienced this problem either when starting namenode or secondary namenode, you can apply the attached patch to get your node up again. This is a patch for 0.20.
          Hide
          Hairong Kuang added a comment -

          i already explained why I committed the patch without a test:
          > No test is included because it is very hard to write tests for race conditions and the fix is simple too.

          Show
          Hairong Kuang added a comment - i already explained why I committed the patch without a test: > No test is included because it is very hard to write tests for race conditions and the fix is simple too.
          Hide
          Konstantin Boudnik added a comment -

          Hairon, for accounting purposes I'd like to have a comment on why no tests were included. It might be obvious right now, but later on it will require an extra effort to find out why -1'ed patch has been committed.

          Show
          Konstantin Boudnik added a comment - Hairon, for accounting purposes I'd like to have a comment on why no tests were included. It might be obvious right now, but later on it will require an extra effort to find out why -1'ed patch has been committed.
          Hide
          Hairong Kuang added a comment -

          All unit tests are passed.

          I've committed this!

          Show
          Hairong Kuang added a comment - All unit tests are passed. I've committed this!
          Hide
          Hairong Kuang added a comment -

          Looks that Hudson have not picked up the submitted patch yet. I ran ant test-patch on my local machine:
          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
          [exec] Please justify why no new tests are needed for this patch.
          [exec] Also please list what manual steps were performed to verify this patch.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

          No test is included because it is very hard to write tests for race conditions and the fix is simple too.

          Show
          Hairong Kuang added a comment - Looks that Hudson have not picked up the submitted patch yet. I ran ant test-patch on my local machine: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. No test is included because it is very hard to write tests for race conditions and the fix is simple too.
          Hide
          Hairong Kuang added a comment -

          Thanks Dhruba!

          Here is a patch for the trunk.

          Show
          Hairong Kuang added a comment - Thanks Dhruba! Here is a patch for the trunk.
          Hide
          dhruba borthakur added a comment -

          Good catch. +1

          Show
          dhruba borthakur added a comment - Good catch. +1
          Hide
          Hairong Kuang added a comment -

          A patch for 0.20.

          Show
          Hairong Kuang added a comment - A patch for 0.20.
          Hide
          Hairong Kuang added a comment -

          The edit log showed that the file that caused the problem was moved to the trash right before it was open for read and thus reset its access time. But how would open succeed? It turned out to be a synchronization problem. NameNode fetches the inode of the file to be open without holding fsnamesystem lock.

          This was approximately what happened. Moving the file to trash and open requests arrived at NN simultaneously. So NN
          1. fetched the inode for the file;
          2. moved the file to trash;
          3. opened the file and set its access time.

          Although 2 removed the file's inode, 3 succeeded because it accessed the file through the inode fetched in 1.

          Show
          Hairong Kuang added a comment - The edit log showed that the file that caused the problem was moved to the trash right before it was open for read and thus reset its access time. But how would open succeed? It turned out to be a synchronization problem. NameNode fetches the inode of the file to be open without holding fsnamesystem lock. This was approximately what happened. Moving the file to trash and open requests arrived at NN simultaneously. So NN 1. fetched the inode for the file; 2. moved the file to trash; 3. opened the file and set its access time. Although 2 removed the file's inode, 3 succeeded because it accessed the file through the inode fetched in 1.

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development