Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3771

Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Won't Fix
    • Affects Version/s: 0.23.3
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None
    • Environment:

      QE, 20 node Federated cluster with 3 NNs and 15 DNs, using Kerberos based security

      Description

      Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty issue recently, which resulted in the cluster's default Namenode being unable to restart, this was on a 20 node Federated cluster with security. The cause appears to be that the NN was just starting to roll its edit log when a shutdown occurred, the shutdown was intentional to restart the cluster as part of an automated test.

      The tests that were running do not appear to be the issue in themselves, the cluster was just wrapping up an adminReport subset and this failure case has not reproduce so far, nor was it failing previously. It looks like a chance occurrence of sending the shutdown just as the edit log roll was begun.

      From the NN log, the following sequence is noted:

      1. an InvalidateBlocks operation had completed
      2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
      3. FSEditLog: Ending log segment 23963
      4. FSEditLog: Starting log segment at 23967
      4. NameNode: SHUTDOWN_MSG
      => the NN shuts down and then is restarted...
      5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were are all in-progress
      6. FSImageTransactionalStorageInspector: Marking log at /grid/[PATH]/edits_inprogress_0000000000000023967 as corrupt since it has no transactions in it.
      7. NameNode: Exception in namenode join [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
      => NN start attempts continue to cycle trying to restart but can't, failing on the same exception due to lack of non-corrupt edit logs

      If observations are correct and issue is from shutdown happening as edit logs are rolling, does the NN have an equivalent to the conventional fs 'sync' blocking action that should be called, or perhaps has a timing hole?

        Issue Links

          Activity

          patrick white created issue -
          Hide
          Todd Lipcon added a comment -

          The following is interesting:

          3. FSEditLog: Ending log segment 23963
          4. FSEditLog: Starting log segment at 23967

          That's not a typo? i.e there's a gap between the end of the previous segment and the start of the next? Perhaps it's just an unrelated logging error, though.

          Show
          Todd Lipcon added a comment - The following is interesting: 3. FSEditLog: Ending log segment 23963 4. FSEditLog: Starting log segment at 23967 That's not a typo? i.e there's a gap between the end of the previous segment and the start of the next? Perhaps it's just an unrelated logging error, though.
          Suresh Srinivas made changes -
          Field Original Value New Value
          Affects Version/s 2.0.0-alpha [ 12320353 ]
          Hide
          patrick white added a comment -

          Hi Todd, right, the TXIDs are correct (unlike my line item numbering in the Description) here's the verbatim log snippet with classpath removed for clarity, hostnames and ipaddr generalized;

          [org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor@3ecec78d]2012-08-06 13:02:52,216 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from [Secondary NN]
          [IPC Server handler 70 on 8020]2012-08-06 13:02:52,217 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs.
          [IPC Server handler 70 on 8020]2012-08-06 13:02:52,217 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 23963
          [IPC Server handler 70 on 8020]2012-08-06 13:02:52,218 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 4 Total time for transactions(ms): 1Number of transactions batched in Syncs: 0 Number of syncs: 5 SyncTimes(ms): 23
          [IPC Server handler 70 on 8020]2012-08-06 13:02:52,220 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 23967
          [IPC Server handler 70 on 8020]2012-08-06 13:02:52,234 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
          /************************************************************
          SHUTDOWN_MSG: Shutting down NameNode at [HOSTNAME/IPADDR]
          ************************************************************/
          [Thread-1]2012-08-06 13:03:39,287 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
          /************************************************************
          STARTUP_MSG: Starting NameNode
          STARTUP_MSG: host = [HOSTNAME/IPADDR]
          STARTUP_MSG: args = []
          STARTUP_MSG: version = 0.23.3.1208042202

          STARTUP_MSG: classpath = << CLASSPATH REMOVED >>

          STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/hadoop-common-project/hadoop-common -r 1368004; compiled by '[QEUSER]' on Sat Aug 4 22:15:58 PDT 2012
          ************************************************************/
          [main]2012-08-06 13:03:39,506 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
          [main]2012-08-06 13:03:39,725 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink simon_jvm started
          [main]2012-08-06 13:03:39,753 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink simon_rpc started
          [main]2012-08-06 13:03:39,800 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink simon_dfs started
          [main]2012-08-06 13:03:39,861 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
          [main]2012-08-06 13:03:39,861 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started
          [main]2012-08-06 13:03:40,342 INFO org.apache.hadoop.security.UserGroupInformation: Login successful for user hdfs/HOSTNAME@DOMAIN using keytab file [KEYTAB]
          [main]2012-08-06 13:03:40,464 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list
          [main]2012-08-06 13:03:40,469 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
          [main]2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: VM type = 64-bit
          [main]2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: 2% max memory = 273.85625 MB
          [main]2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: capacity = 2^25 = 33554432 entries
          [main]2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: recommended=33554432, actual=33554432
          [main]2012-08-06 13:03:40,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.block.access.token.enable=true
          [main]2012-08-06 13:03:40,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.block.access.key.update.interval=600 min(s), dfs.block.access.token.lifetime=600 min(s)
          [main]2012-08-06 13:03:40,683 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: defaultReplication = 3
          [main]2012-08-06 13:03:40,683 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplication = 50
          [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: minReplication = 1
          [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplicationStreams = 2
          [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: shouldCheckForEnoughRacks = true
          [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: replicationRecheckInterval = 3000
          [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hdfs/[HOSTNAME@DOMAIN] (auth:KERBEROS)
          [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=hdfs
          [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
          [main]2012-08-06 13:03:40,692 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemState MBean
          [main]2012-08-06 13:03:40,732 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times
          [main]2012-08-06 13:03:40,810 WARN org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector: Logs beginning at txid were are all in-progress (probably truncated due to a previous NameNode crash)
          [main]2012-08-06 13:03:40,817 WARN org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector: Marking log at /grid/[PATH]/edits_inprogress_0000000000000023967 as corrupt since it has no tra nsactions in it.
          [main]2012-08-06 13:03:40,818 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
          [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
          at org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector$LogGroup.getBestNonCorruptLog(FSImageTransactionalStorageInspector.java:302)
          at org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.createLogLoadPlan(FSImageTransactionalStorageInspector.java:230)
          at org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.createLoadPlan(FSImageTransactionalStorageInspector.java:178)
          at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:592)
          at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276)
          at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:175)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:333)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:305)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:299)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:333)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:459)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:451)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:752)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:800)
          2012-08-06 13:03:40,820 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
          /************************************************************
          SHUTDOWN_MSG: Shutting down NameNode at [HOSTNAME/IPADDR]
          ************************************************************/

          Show
          patrick white added a comment - Hi Todd, right, the TXIDs are correct (unlike my line item numbering in the Description) here's the verbatim log snippet with classpath removed for clarity, hostnames and ipaddr generalized; [org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor@3ecec78d] 2012-08-06 13:02:52,216 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from [Secondary NN] [IPC Server handler 70 on 8020] 2012-08-06 13:02:52,217 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs. [IPC Server handler 70 on 8020] 2012-08-06 13:02:52,217 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 23963 [IPC Server handler 70 on 8020] 2012-08-06 13:02:52,218 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 4 Total time for transactions(ms): 1Number of transactions batched in Syncs: 0 Number of syncs: 5 SyncTimes(ms): 23 [IPC Server handler 70 on 8020] 2012-08-06 13:02:52,220 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 23967 [IPC Server handler 70 on 8020] 2012-08-06 13:02:52,234 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at [HOSTNAME/IPADDR] ************************************************************/ [Thread-1] 2012-08-06 13:03:39,287 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = [HOSTNAME/IPADDR] STARTUP_MSG: args = [] STARTUP_MSG: version = 0.23.3.1208042202 STARTUP_MSG: classpath = << CLASSPATH REMOVED >> STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/hadoop-common-project/hadoop-common -r 1368004; compiled by ' [QEUSER] ' on Sat Aug 4 22:15:58 PDT 2012 ************************************************************/ [main] 2012-08-06 13:03:39,506 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties [main] 2012-08-06 13:03:39,725 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink simon_jvm started [main] 2012-08-06 13:03:39,753 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink simon_rpc started [main] 2012-08-06 13:03:39,800 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink simon_dfs started [main] 2012-08-06 13:03:39,861 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). [main] 2012-08-06 13:03:39,861 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started [main] 2012-08-06 13:03:40,342 INFO org.apache.hadoop.security.UserGroupInformation: Login successful for user hdfs/HOSTNAME@DOMAIN using keytab file [KEYTAB] [main] 2012-08-06 13:03:40,464 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list [main] 2012-08-06 13:03:40,469 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000 [main] 2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: VM type = 64-bit [main] 2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: 2% max memory = 273.85625 MB [main] 2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: capacity = 2^25 = 33554432 entries [main] 2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: recommended=33554432, actual=33554432 [main] 2012-08-06 13:03:40,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.block.access.token.enable=true [main] 2012-08-06 13:03:40,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.block.access.key.update.interval=600 min(s), dfs.block.access.token.lifetime=600 min(s) [main] 2012-08-06 13:03:40,683 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: defaultReplication = 3 [main] 2012-08-06 13:03:40,683 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplication = 50 [main] 2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: minReplication = 1 [main] 2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplicationStreams = 2 [main] 2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: shouldCheckForEnoughRacks = true [main] 2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: replicationRecheckInterval = 3000 [main] 2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hdfs/ [HOSTNAME@DOMAIN] (auth:KERBEROS) [main] 2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=hdfs [main] 2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true [main] 2012-08-06 13:03:40,692 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemState MBean [main] 2012-08-06 13:03:40,732 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times [main] 2012-08-06 13:03:40,810 WARN org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector: Logs beginning at txid were are all in-progress (probably truncated due to a previous NameNode crash) [main] 2012-08-06 13:03:40,817 WARN org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector: Marking log at /grid/ [PATH] /edits_inprogress_0000000000000023967 as corrupt since it has no tra nsactions in it. [main] 2012-08-06 13:03:40,818 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join [main] java.lang.IllegalStateException: No non-corrupt logs for txid 23967 at org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector$LogGroup.getBestNonCorruptLog(FSImageTransactionalStorageInspector.java:302) at org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.createLogLoadPlan(FSImageTransactionalStorageInspector.java:230) at org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.createLoadPlan(FSImageTransactionalStorageInspector.java:178) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:592) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:175) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:333) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:305) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:299) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:333) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:459) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:451) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:752) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:800) 2012-08-06 13:03:40,820 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at [HOSTNAME/IPADDR] ************************************************************/
          Hide
          Todd Lipcon added a comment -

          Ah, OK, I think the log message thing I mentioned above was a red herring. The previous segment started at 23963, and that's what it was logging. Not a problem.

          Can you upload /grid/[PATH]/edits_inprogress_0000000000000023967 which may now be renamed with a ".corrupt" suffix of some kind? I want to make sure it is in fact empty and not some kind of strange corruption.

          Show
          Todd Lipcon added a comment - Ah, OK, I think the log message thing I mentioned above was a red herring. The previous segment started at 23963, and that's what it was logging. Not a problem. Can you upload /grid/ [PATH] /edits_inprogress_0000000000000023967 which may now be renamed with a ".corrupt" suffix of some kind? I want to make sure it is in fact empty and not some kind of strange corruption.
          Hide
          patrick white added a comment -

          Sorry, wished i could, i got the nn log but didn't get the edit log. i initially thought this was coming from a namespace corruption bug and redeployed trying to reproduce the issue before narrowing it down to the edit logs. If i can reproduce this, i'll be sure to grab the edits.

          Show
          patrick white added a comment - Sorry, wished i could, i got the nn log but didn't get the edit log. i initially thought this was coming from a namespace corruption bug and redeployed trying to reproduce the issue before narrowing it down to the edit logs. If i can reproduce this, i'll be sure to grab the edits.
          Hide
          patrick white added a comment -

          Hi Todd, do you mind if i ask your opinion on the potential impact of this issue, and possible mitigation for it? So far we've not been able to reproduce the issue, either in the nightly regression testing or in specific tests. That, combined with regression history, would tend to indicate this is very unlikely to occur operationally, however if it would occur the affects could be severe. The NN restart should not be difficult once the cause is identified as corrupt edits, it should come up once the corrupt logs are cleared. Is that true or am i missing an event sequence tracking that would need to be corrected?

          The larger concern should be for data loss. Based on what happened in this case it appears that any pending txids would be lost, unless the edit logs could be manually repaired. The filesystem would be intact, only minus the changes from the outstanding edit events, does that sound correct?

          Show
          patrick white added a comment - Hi Todd, do you mind if i ask your opinion on the potential impact of this issue, and possible mitigation for it? So far we've not been able to reproduce the issue, either in the nightly regression testing or in specific tests. That, combined with regression history, would tend to indicate this is very unlikely to occur operationally, however if it would occur the affects could be severe. The NN restart should not be difficult once the cause is identified as corrupt edits, it should come up once the corrupt logs are cleared. Is that true or am i missing an event sequence tracking that would need to be corrected? The larger concern should be for data loss. Based on what happened in this case it appears that any pending txids would be lost, unless the edit logs could be manually repaired. The filesystem would be intact, only minus the changes from the outstanding edit events, does that sound correct?
          Hide
          Todd Lipcon added a comment -

          Hey Patrick. I think this behavior might have been fixed in 2.0.0 already – the empty file should get properly ignored and the NN should start up.

          Perhaps you can instigate this failure again by adding "System.exit(0)" right before where START_LOG_SEGMENT is logged in startLogSegmentAndWriteHeaderTxn. That would allow you to see what the right recovery steps are.

          The issue seems to be described in HDFS-2093... I think the following comment may be relevant:

          Thus in the situation above, where the only log we have is this corrupted one, it will refuse to let the NN start, with a nice message explaining that the logs starting at this txid are corrupt with no txns. The operator can then double-check whether a different storage drive which possibly went missing might have better logs, etc, before starting NN.

          Looking at your logs, it seems like you have only one edits directory. So the above probably applies, and you could successfully start by removing that last (empty) log segment.

          The larger concern should be for data loss. Based on what happened in this case it appears that any pending txids would be lost, unless the edit logs could be manually repaired. The filesystem would be intact, only minus the changes from the outstanding edit events, does that sound correct?

          Only "in-flight" transactions could be lost – ie those that were never ACKed to a client. Anything that has been ACKed would have been fsynced to the log, and thus not lost. So, after inspecting the segment to make sure there are truly no transactions, you should be able to remove it and start with no data loss or corruption whatsoever.

          Show
          Todd Lipcon added a comment - Hey Patrick. I think this behavior might have been fixed in 2.0.0 already – the empty file should get properly ignored and the NN should start up. Perhaps you can instigate this failure again by adding "System.exit(0)" right before where START_LOG_SEGMENT is logged in startLogSegmentAndWriteHeaderTxn . That would allow you to see what the right recovery steps are. The issue seems to be described in HDFS-2093 ... I think the following comment may be relevant: Thus in the situation above, where the only log we have is this corrupted one, it will refuse to let the NN start, with a nice message explaining that the logs starting at this txid are corrupt with no txns. The operator can then double-check whether a different storage drive which possibly went missing might have better logs, etc, before starting NN. Looking at your logs, it seems like you have only one edits directory. So the above probably applies, and you could successfully start by removing that last (empty) log segment. The larger concern should be for data loss. Based on what happened in this case it appears that any pending txids would be lost, unless the edit logs could be manually repaired. The filesystem would be intact, only minus the changes from the outstanding edit events, does that sound correct? Only "in-flight" transactions could be lost – ie those that were never ACKed to a client. Anything that has been ACKed would have been fsynced to the log, and thus not lost. So, after inspecting the segment to make sure there are truly no transactions, you should be able to remove it and start with no data loss or corruption whatsoever.
          Hide
          patrick white added a comment -

          Thanks very much Todd, appreciate the feedback and references, and the suggestion on using exit to try to reproduce this.

          Show
          patrick white added a comment - Thanks very much Todd, appreciate the feedback and references, and the suggestion on using exit to try to reproduce this.
          Hide
          Bach Bui added a comment -

          I reproduced this case by simulating the described NN shutdown situation with an exit(0) right after jas.startLogSegment(segmentTxId) in org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(long, boolean)

          This action in effect created an edit_inprogress file that has no transaction in it. NN will now fail to restart, because the error handling code can not handle this case.

          An easy work around is to delete the edit_inprogress file. As Todd mentioned, there will be no loss in data when we do this, am I right Todd?

          Ultimately, we need to fix the error handling code in org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.LogGroup.planAllInProgressRecovery() so that it can detect this situation. It does not seem to be very complicated as this is only a conner case. Please correct me if I am wrong.

          Could someone also tell me how the NN is shutdown? It seems to me this situation only occur if the NN threads are killed without waiting for them to cleanup themselves.

          Show
          Bach Bui added a comment - I reproduced this case by simulating the described NN shutdown situation with an exit(0) right after jas.startLogSegment(segmentTxId) in org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(long, boolean) This action in effect created an edit_inprogress file that has no transaction in it. NN will now fail to restart, because the error handling code can not handle this case. An easy work around is to delete the edit_inprogress file. As Todd mentioned, there will be no loss in data when we do this, am I right Todd? Ultimately, we need to fix the error handling code in org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.LogGroup.planAllInProgressRecovery() so that it can detect this situation. It does not seem to be very complicated as this is only a conner case. Please correct me if I am wrong. Could someone also tell me how the NN is shutdown? It seems to me this situation only occur if the NN threads are killed without waiting for them to cleanup themselves.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          If there is no transaction in edit_inprogress, it can be safely deleted. However, how to tell if it does not have transactions if the file is not empty?

          Show
          Tsz Wo Nicholas Sze added a comment - If there is no transaction in edit_inprogress, it can be safely deleted. However, how to tell if it does not have transactions if the file is not empty?
          Hide
          Todd Lipcon added a comment -

          However, how to tell if it does not have transactions if the file is not empty?

          What do you mean by this? If it's not empty but has a valid header and an OP_INVALID, then it is the same as bytewise empty, and can be deleted. If it has a partial transaction in it, it should be handled the same way that any partial transaction is handled at the end of a file.

          Show
          Todd Lipcon added a comment - However, how to tell if it does not have transactions if the file is not empty? What do you mean by this? If it's not empty but has a valid header and an OP_INVALID, then it is the same as bytewise empty, and can be deleted. If it has a partial transaction in it, it should be handled the same way that any partial transaction is handled at the end of a file.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > However, how to tell if it does not have transactions if the file is not empty?

          What do you mean by this? ...

          It is a typo. It should read as ".. but the files is not empty".

          ... If it's not empty but has a valid header and an OP_INVALID, ...

          Do you mean that the operator need to use a hex editor to check the editlog?

          Show
          Tsz Wo Nicholas Sze added a comment - > However, how to tell if it does not have transactions if the file is not empty? What do you mean by this? ... It is a typo. It should read as ".. but the files is not empty". ... If it's not empty but has a valid header and an OP_INVALID, ... Do you mean that the operator need to use a hex editor to check the editlog?
          Hide
          Todd Lipcon added a comment -

          Do you mean that the operator need to use a hex editor to check the editlog?

          No, this should happen automatically with the current trunk code - it would validate as an empty log (no transactions) and thus would be moved aside.

          Show
          Todd Lipcon added a comment - Do you mean that the operator need to use a hex editor to check the editlog? No, this should happen automatically with the current trunk code - it would validate as an empty log (no transactions) and thus would be moved aside.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > No, this should happen automatically with the current trunk code ...

          Do you mean HDFS-2093? HDFS-2093 is also in 0.23. It seems that HDFS-2093 does not solve the problem.

          Show
          Tsz Wo Nicholas Sze added a comment - > No, this should happen automatically with the current trunk code ... Do you mean HDFS-2093 ? HDFS-2093 is also in 0.23. It seems that HDFS-2093 does not solve the problem.
          Hide
          Todd Lipcon added a comment -

          HDFS-2824 is what fixed this in 2.x

          Show
          Todd Lipcon added a comment - HDFS-2824 is what fixed this in 2.x
          Tsz Wo Nicholas Sze made changes -
          Link This issue relates to HDFS-2824 [ HDFS-2824 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          It does look like that HDFS-2824 should fix this. Thanks.

          Show
          Tsz Wo Nicholas Sze added a comment - It does look like that HDFS-2824 should fix this. Thanks.
          Hide
          Suresh Srinivas added a comment -

          Is this bug still needed. Can this be closed?

          Show
          Suresh Srinivas added a comment - Is this bug still needed. Can this be closed?
          Hide
          Todd Lipcon added a comment -

          This isn't needed in 2.x - perhaps the 0.23.x maintainers want to keep this open for 0.23.x? Otherwise feel free to close. (I removed the 2.x affects version)

          Show
          Todd Lipcon added a comment - This isn't needed in 2.x - perhaps the 0.23.x maintainers want to keep this open for 0.23.x? Otherwise feel free to close. (I removed the 2.x affects version)
          Todd Lipcon made changes -
          Affects Version/s 2.0.0-alpha [ 12320353 ]
          Hide
          Kihwal Lee added a comment -

          I'm resolving this as WONTFIX. It is not an issue in 2.0 and there is a workaround for 0.23 if this rare condition occurs.

          Show
          Kihwal Lee added a comment - I'm resolving this as WONTFIX. It is not an issue in 2.0 and there is a workaround for 0.23 if this rare condition occurs.
          Kihwal Lee made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Won't Fix [ 2 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              patrick white
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development