Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • QuorumJournalManager (HDFS-3077)
    • ha
    • None

    Description

      During fault testing of QJM, I saw the following issue:

      1) NN sends txn 5 to JN
      2) NN gets partitioned from JN while JN remains up. The next two RPCs are missed while the partition has happened:
      2a) finalizeSegment(1-5)
      2b) startSegment(6)
      3) NN sends txn 6 to JN

      This caused one of the JNs to end up with a segment 1-10 while the others had two segments; 1-5 and 6-10. This broke some invariants of the QJM protocol and prevented the recovery protocol from running properly.

      This can be addressed on the client side by HDFS-3726, which would cause the NN to not send the RPC in #3. But it makes sense to also add an extra safety check here on the server side: with every journal() call, we can send the segment's txid. Then if the JN and the client get "out of sync", the JN can reject the RPCs.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tlipcon Todd Lipcon Assign to me
            tlipcon Todd Lipcon
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment