Hadoop Common
  1. Hadoop Common
  2. HADOOP-6774

Namenode is not able to recover from disk full condition

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 0.20.2, 0.21.0
    • Fix Version/s: None
    • Component/s: fs
    • Labels:
      None
    • Environment:

      Linux sjc9-flash-grid00.ciq.com 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

    • Release Note:
      Hide
      Implemented a daemon thread to monitor the disk usage for periodically and if the disk usage reaches the threshold value, put the name node into Safe mode so that no modification to file system will occur. Once the disk usage reaches below the threshold, name node will be put out of the safe mode. Here threshold value and interval to check the disk usage are configurable.
      Show
      Implemented a daemon thread to monitor the disk usage for periodically and if the disk usage reaches the threshold value, put the name node into Safe mode so that no modification to file system will occur. Once the disk usage reaches below the threshold, name node will be put out of the safe mode. Here threshold value and interval to check the disk usage are configurable.

      Description

      We ran an internal flow which resulted in:
      Exception in thread "main" java.lang.RuntimeException: initialization of flow executor failed

      After that we freed disk space on the Namenode server, but restarting Namenode failed.
      Here is from Namenode log:

      2010-05-19 17:15:15,514 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: sjc1-qa-certiq1.sjc1.ciq.com/10.201.8.247:9000
      2010-05-19 17:15:15,516 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null
      2010-05-19 17:15:15,518 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext
      2010-05-19 17:15:15,579 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
      2010-05-19 17:15:15,579 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
      2010-05-19 17:15:15,579 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
      2010-05-19 17:15:15,588 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext
      2010-05-19 17:15:15,590 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean
      2010-05-19 17:15:15,637 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1874
      2010-05-19 17:15:16,202 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 2
      2010-05-19 17:15:16,204 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 259450 loaded in 0 seconds.
      2010-05-19 17:15:16,599 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: ""
      at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
      at java.lang.Long.parseLong(Long.java:431)
      at java.lang.Long.parseLong(Long.java:468)
      at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1273)
      at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:656)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:999)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
      at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:312)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:293)
      at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:224)
      at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:306)
      at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1004)
      at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1013)

      2010-05-19 17:15:16,599 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

      1. HADOOP-6774.patch
        103 kB
        Devaraj K
      2. hadoop-6774.stack
        8 kB
        Ted Yu

        Issue Links

          Activity

          Ted Yu created issue -
          Hide
          Ted Yu added a comment -

          Here is the stack trace for job tracker which didn't respond to shutdown request

          Show
          Ted Yu added a comment - Here is the stack trace for job tracker which didn't respond to shutdown request
          Ted Yu made changes -
          Field Original Value New Value
          Attachment hadoop-6774.stack [ 12444977 ]
          Hide
          Devaraj K added a comment -

          When the disk becomes full, name node file system (fsimage, edits) is getting corrupted and also name node is getting shutdown. When we try to restart, name node is not starting because the name node file system is corrupted.

          This can be avoid this way,

          We can implement a daemon to monitor the disk usage for periodically and if the disk usage reaches the threshold value, put the name node into Safe mode so that no modification to file system will occur. Once the disk usage reaches below the threshold, name node will be put out of the safe mode.

          Please suggest if any body has any other opinions/suggestions.

          Show
          Devaraj K added a comment - When the disk becomes full, name node file system (fsimage, edits) is getting corrupted and also name node is getting shutdown. When we try to restart, name node is not starting because the name node file system is corrupted. This can be avoid this way, We can implement a daemon to monitor the disk usage for periodically and if the disk usage reaches the threshold value, put the name node into Safe mode so that no modification to file system will occur. Once the disk usage reaches below the threshold, name node will be put out of the safe mode. Please suggest if any body has any other opinions/suggestions.
          Hide
          Devaraj K added a comment -

          Attached the patch as per the above solution.

          Show
          Devaraj K added a comment - Attached the patch as per the above solution.
          Devaraj K made changes -
          Attachment HADOOP-6774.patch [ 12469144 ]
          Devaraj K made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note Implemented a daemon thread to monitor the disk usage for periodically and if the disk usage reaches the threshold value, put the name node into Safe mode so that no modification to file system will occur. Once the disk usage reaches below the threshold, name node will be put out of the safe mode. Here threshold value and interval to check the disk usage are configurable.
          Affects Version/s 0.21.0 [ 12313563 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12469144/HADOOP-6774.patch
          against trunk revision 1062543.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/195//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12469144/HADOOP-6774.patch against trunk revision 1062543. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/195//console This message is automatically generated.
          Todd Lipcon made changes -
          Link This issue relates to HDFS-1566 [ HDFS-1566 ]
          Hide
          Todd Lipcon added a comment -

          Hi Devaraj. This patch seems to introduce lots of whitespace changes and also doesn't apply to trunk. Could you reformat it without the spurious changes and rebase on trunk?

          Show
          Todd Lipcon added a comment - Hi Devaraj. This patch seems to introduce lots of whitespace changes and also doesn't apply to trunk. Could you reformat it without the spurious changes and rebase on trunk?
          Hide
          Todd Lipcon added a comment -

          I see you opened HDFS-1594 - this one should be closed, right, since this is a HDFS patch not Common?

          Show
          Todd Lipcon added a comment - I see you opened HDFS-1594 - this one should be closed, right, since this is a HDFS patch not Common?
          Hide
          Devaraj K added a comment -

          Yes Todd. This issue belongs to hdfs that's why patch could not apply. This can be closed.

          HDFS-1594 can be processed further.

          Show
          Devaraj K added a comment - Yes Todd. This issue belongs to hdfs that's why patch could not apply. This can be closed. HDFS-1594 can be processed further.
          Hide
          Konstantin Boudnik added a comment -

          Closed as a dup of HDFS-1594

          Show
          Konstantin Boudnik added a comment - Closed as a dup of HDFS-1594
          Konstantin Boudnik made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Duplicate [ 3 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          249d 17h 9m 1 Devaraj K 24/Jan/11 12:01
          Patch Available Patch Available Resolved Resolved
          22d 7h 40m 1 Konstantin Boudnik 15/Feb/11 19:42

            People

            • Assignee:
              Unassigned
              Reporter:
              Ted Yu
            • Votes:
              3 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development