Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • QuorumJournalManager (HDFS-3077)
    • ha
    • None

    Description

      Per one of the TODOs in Journal.java, there is currently a lack of atomicity in the acceptRecovery() code path. In particular, we have the following actions executed non-atomically:

      • Download a new edits_inprogress_N from some other node
      • Persist the paxos recovery file to disk.

      If the JN crashes between these two steps, then we may be left in the state whereby the edits_inprogress file has different data than the Paxos data left over on the disk from a previous recovery attempt. This causes the next prepareRecovery() to fail with an AssertionError.

      I discovered this by randomly injecting a fault between the two steps, and then running the randomized fault test on a cluster. This resulted in some AssertionErrors in the test logs.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tlipcon Todd Lipcon Assign to me
            tlipcon Todd Lipcon
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment