Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-1382

A transient failure with edits log and a corrupted fstime together could lead to a data loss

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • namenode
    • None

    Description

      We experienced a data loss situation that due to double failures.
      One is transient disk failure with edits logs and the other is corrupted fstime.

      Here is the detail:

      1. NameNode has 2 edits directory (say edit0 and edit1)

      2. During an update to edit0, there is a transient disk failure,
      making NameNode bump the fstime and mark edit0 as stale
      and continue working with edit1.

      3. NameNode is shut down. Now, and unluckily fstime in edit0
      is corrupted. Hence during NameNode startup, the log in edit0
      is replayed, hence data loss.

      This bug was found by our Failure Testing Service framework:
      http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
      For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and
      Haryadi Gunawi (haryadi@eecs.berkeley.edu)

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            thanhdo Thanh Do
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment