Details

    • Type: Sub-task Sub-task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: Shared journals (HDFS-3092)
    • Fix Version/s: None
    • Component/s: ha, namenode
    • Labels:
      None

      Description

      epoch received over JournalProtocol should be persisted by JournalService.

        Activity

        Hide
        Suresh Srinivas added a comment -

        There is some discussion in HDFS-3077 about this. Currently two alternatives under consideration are:

        1. Use the record we write during starting of a log segment to record the epoch.
          • On fence method call, a JournalService promises not to accept any other requests from old active.
          • After fence, the next call is to roll, when a new log segment is created. JournalService records in this record the epoch.
          • This fits in nicely with every log segment belongs to a single epoch.
        2. Use a separate metadata file to record epoch.

        Based on discussions in 3077, lets choose one of the options.

        Show
        Suresh Srinivas added a comment - There is some discussion in HDFS-3077 about this. Currently two alternatives under consideration are: Use the record we write during starting of a log segment to record the epoch. On fence method call, a JournalService promises not to accept any other requests from old active. After fence, the next call is to roll, when a new log segment is created. JournalService records in this record the epoch. This fits in nicely with every log segment belongs to a single epoch. Use a separate metadata file to record epoch. Based on discussions in 3077, lets choose one of the options.
        Hide
        Todd Lipcon added a comment -

        I don't think it's reasonable to put the epoch number inside the START transaction, because that leaks the idea of epochs out of the journal manager layer into the NN layer.

        Also, if the JN restarts, when it comes up, how do you make sure that an old NN doesn't come back to life with a startLogSegment transaction?

        I think you need to record the epoch number separately from the idea of segments, for fencing purposes, since you aren't always guaranteed to be in the middle of a segment, and you don't want disagreement about who gets to call startLogSegment.

        Show
        Todd Lipcon added a comment - I don't think it's reasonable to put the epoch number inside the START transaction, because that leaks the idea of epochs out of the journal manager layer into the NN layer. Also, if the JN restarts, when it comes up, how do you make sure that an old NN doesn't come back to life with a startLogSegment transaction? I think you need to record the epoch number separately from the idea of segments, for fencing purposes, since you aren't always guaranteed to be in the middle of a segment, and you don't want disagreement about who gets to call startLogSegment.
        Hide
        Bikas Saha added a comment -

        I have been trying to read ZAB and re-read PAXOS before I make some comments on some of the epoch stuff.
        At first glance, it seems to me that some of these operations need to be atomic. I havent caught up with HDFS-3077 but I remember Tod clarifying to an example of mine by saying that edit log segments are relevant in the context of an epoch. So 2 edit logs with same txid but can be differentiated using epochs. In that case, it makes sense tying the epoch to segment relation in the roll via 1 above. Because then creating a segment and attaching it to an epoch would be 1 operation to the extent rolling is 1 operation.
        2. might be less optimal because now it consists of 2 operations. 1) rolling the log and creating a new segment 2) updating a metadata file.
        However, my understanding of rolling might be incomplete. So please take this with the necessary pinch of salt

        Show
        Bikas Saha added a comment - I have been trying to read ZAB and re-read PAXOS before I make some comments on some of the epoch stuff. At first glance, it seems to me that some of these operations need to be atomic. I havent caught up with HDFS-3077 but I remember Tod clarifying to an example of mine by saying that edit log segments are relevant in the context of an epoch. So 2 edit logs with same txid but can be differentiated using epochs. In that case, it makes sense tying the epoch to segment relation in the roll via 1 above. Because then creating a segment and attaching it to an epoch would be 1 operation to the extent rolling is 1 operation. 2. might be less optimal because now it consists of 2 operations. 1) rolling the log and creating a new segment 2) updating a metadata file. However, my understanding of rolling might be incomplete. So please take this with the necessary pinch of salt
        Hide
        Tsz Wo Nicholas Sze added a comment -

        > Also, if the JN restarts, when it comes up, how do you make sure that an old NN doesn't come back to life with a startLogSegment transaction?

        Is it the case that JN will reject it since the old NN has a smaller epoch?

        Show
        Tsz Wo Nicholas Sze added a comment - > Also, if the JN restarts, when it comes up, how do you make sure that an old NN doesn't come back to life with a startLogSegment transaction? Is it the case that JN will reject it since the old NN has a smaller epoch?
        Hide
        Todd Lipcon added a comment -

        Is it the case that JN will reject it since the old NN has a smaller epoch?

        Right – that's why it needs to persist, IMO.

        2. might be less optimal because now it consists of 2 operations. 1) rolling the log and creating a new segment 2) updating a metadata file.

        I think it's just a matter of getting the ordering right. Before starting a log segment, you need to fence prior writers. The fencing step is what writes down the epoch. Then, when you create a new log segment, you tag it (eg by storing it in a directory per-epoch, or by writing a metadata file next to it before you create the file). I think this is sufficiently atomic.

        So 2 edit logs with same txid but can be differentiated using epochs

        I've had another idea which I want to write up in the design doc. But, basically, I think we can solve this problem more simply by the following:

        • Currently, when FSEditLog starts a new segment, it calls journal.startLogSegment(), then journal.logEdit(StartLogSegmentOp), then journal.logSync(). So there is a point of time when the log segment is empty, with no transactions. If instead, we changed it so that the startLogSegment() call was responsible for writing the first transaction (and only the first), atomically, then we might not have a problem. We just have to make the restriction that the first transaction of any segment is always deterministic (eg just START_LOG_SEGMENT(txid) and nothing else).

        Let me revise the design doc in HDFS-3077 with this idea to see if it works when fully fleshed out.

        Show
        Todd Lipcon added a comment - Is it the case that JN will reject it since the old NN has a smaller epoch? Right – that's why it needs to persist, IMO. 2. might be less optimal because now it consists of 2 operations. 1) rolling the log and creating a new segment 2) updating a metadata file. I think it's just a matter of getting the ordering right. Before starting a log segment, you need to fence prior writers. The fencing step is what writes down the epoch. Then, when you create a new log segment, you tag it (eg by storing it in a directory per-epoch, or by writing a metadata file next to it before you create the file). I think this is sufficiently atomic. So 2 edit logs with same txid but can be differentiated using epochs I've had another idea which I want to write up in the design doc. But, basically, I think we can solve this problem more simply by the following: Currently, when FSEditLog starts a new segment, it calls journal.startLogSegment(), then journal.logEdit(StartLogSegmentOp), then journal.logSync(). So there is a point of time when the log segment is empty, with no transactions. If instead, we changed it so that the startLogSegment() call was responsible for writing the first transaction (and only the first), atomically, then we might not have a problem. We just have to make the restriction that the first transaction of any segment is always deterministic (eg just START_LOG_SEGMENT(txid) and nothing else). Let me revise the design doc in HDFS-3077 with this idea to see if it works when fully fleshed out.
        Hide
        Suresh Srinivas added a comment -

        I don't think it's reasonable to put the epoch number inside the START transaction, because that leaks the idea of epochs out of the journal manager layer into the NN layer.

        I do not understand what you mean by NN layer. Epoch is a notion from JournalManager to the JournalNode. Both need to understand this and provide appropriate guarantees.

        Also, if the JN restarts, when it comes up, how do you make sure that an old NN doesn't come back to life with a startLogSegment transaction?

        Can you give me an example. I am not sure I understand the issue.

        Show
        Suresh Srinivas added a comment - I don't think it's reasonable to put the epoch number inside the START transaction, because that leaks the idea of epochs out of the journal manager layer into the NN layer. I do not understand what you mean by NN layer. Epoch is a notion from JournalManager to the JournalNode. Both need to understand this and provide appropriate guarantees. Also, if the JN restarts, when it comes up, how do you make sure that an old NN doesn't come back to life with a startLogSegment transaction? Can you give me an example. I am not sure I understand the issue.
        Hide
        Suresh Srinivas added a comment -

        Currently, when FSEditLog starts a new segment, it calls journal.startLogSegment(), then journal.logEdit(StartLogSegmentOp), then journal.logSync(). So there is a point of time when the log segment is empty, with no transactions. If instead, we changed it so that the startLogSegment() call was responsible for writing the first transaction (and only the first), atomically, then we might not have a problem. We just have to make the restriction that the first transaction of any segment is always deterministic (eg just START_LOG_SEGMENT(txid) and nothing else).

        Okay, I am surprise to find this. All along, in previous discussions, I have been assuming that JournalManager calls roll to JournalService and the startLog transaction is recorded in JournalService. This is when epoch also gets persisted along with that record.

        I think it's just a matter of getting the ordering right. Before starting a log segment, you need to fence prior writers. The fencing step is what writes down the epoch. Then, when you create a new log segment, you tag it (eg by storing it in a directory per-epoch, or by writing a metadata file next to it before you create the file). I think this is sufficiently atomic.

        Whether you store it in a directory per-epoch or record it in the startlogSegment record at the beginning of the segment - they are essentially the same.

        Show
        Suresh Srinivas added a comment - Currently, when FSEditLog starts a new segment, it calls journal.startLogSegment(), then journal.logEdit(StartLogSegmentOp), then journal.logSync(). So there is a point of time when the log segment is empty, with no transactions. If instead, we changed it so that the startLogSegment() call was responsible for writing the first transaction (and only the first), atomically, then we might not have a problem. We just have to make the restriction that the first transaction of any segment is always deterministic (eg just START_LOG_SEGMENT(txid) and nothing else). Okay, I am surprise to find this. All along, in previous discussions, I have been assuming that JournalManager calls roll to JournalService and the startLog transaction is recorded in JournalService. This is when epoch also gets persisted along with that record. I think it's just a matter of getting the ordering right. Before starting a log segment, you need to fence prior writers. The fencing step is what writes down the epoch. Then, when you create a new log segment, you tag it (eg by storing it in a directory per-epoch, or by writing a metadata file next to it before you create the file). I think this is sufficiently atomic. Whether you store it in a directory per-epoch or record it in the startlogSegment record at the beginning of the segment - they are essentially the same.
        Hide
        Todd Lipcon added a comment -

        I do not understand what you mean by NN layer. Epoch is a notion from JournalManager to the JournalNode. Both need to understand this and provide appropriate guarantees.

        Currently, the NN code when starting a new log segment looks like this:

              editLogStream = journalSet.startLogSegment(segmentTxId);
        ...
            if (writeHeaderTxn) {
              logEdit(LogSegmentOp.getInstance(
                  FSEditLogOpCodes.OP_START_LOG_SEGMENT));
              logSync();
            }
        

        So the operation of starting a segment, and writing the OP_START_LOG_SEGMENT transaction are separate. In general, the JournalManager abstraction doesn't know about the contents of the edits it's writing – it's just responsible for bytes. If you wanted to include the epoch number in the OP_START_LOG_SEGMENT transaction, you'd have to have the NN code do something like journalManager.getCurrentEpoch(), and then feed that into the logEdit call. But that's not very generic, so it seems like a leak of abstraction.

        Whether you store it in a directory per-epoch or record it in the startlogSegment record at the beginning of the segment - they are essentially the same.

        I agree, if you're talking about prefixing it at the beginning of the file, before the first transaction. But, if you're talking about actually putting it in the content of the first transaction, I think it's a bad idea for the reason above. My preference is to keep it separated from the file, so that the files written by JournalDaemon are exactly identical to the files that would be written by FileJournalManager. That allows you to copy to and from the different types of nodes without any difference in format.

        Show
        Todd Lipcon added a comment - I do not understand what you mean by NN layer. Epoch is a notion from JournalManager to the JournalNode. Both need to understand this and provide appropriate guarantees. Currently, the NN code when starting a new log segment looks like this: editLogStream = journalSet.startLogSegment(segmentTxId); ... if (writeHeaderTxn) { logEdit(LogSegmentOp.getInstance( FSEditLogOpCodes.OP_START_LOG_SEGMENT)); logSync(); } So the operation of starting a segment, and writing the OP_START_LOG_SEGMENT transaction are separate. In general, the JournalManager abstraction doesn't know about the contents of the edits it's writing – it's just responsible for bytes. If you wanted to include the epoch number in the OP_START_LOG_SEGMENT transaction, you'd have to have the NN code do something like journalManager.getCurrentEpoch() , and then feed that into the logEdit call. But that's not very generic, so it seems like a leak of abstraction. Whether you store it in a directory per-epoch or record it in the startlogSegment record at the beginning of the segment - they are essentially the same. I agree, if you're talking about prefixing it at the beginning of the file, before the first transaction. But, if you're talking about actually putting it in the content of the first transaction, I think it's a bad idea for the reason above. My preference is to keep it separated from the file, so that the files written by JournalDaemon are exactly identical to the files that would be written by FileJournalManager. That allows you to copy to and from the different types of nodes without any difference in format.
        Hide
        Hari Mankude added a comment -

        I agree, if you're talking about prefixing it at the beginning of the file, before the first transaction. But, if you're talking about actually putting it in the content of the first transaction, I think it's a bad idea for the reason above.

        Todd, if you are referring to creating a edit log with the name format edit_log_<epoch_num>in_progress or when finalized edit_log<epoch_number><start_txid><end_txid>, it is a better solution that creating a seperate metadata file. Otherwise, Suresh's solution in adding the epoch number in start log segment sounds good. Actually, for debugging purposes, we should add more information such as time when the journal was started, NN id of owner etc along with epoch number. Basically convert OP_START_LOG_SEGMENT to hold journal header info.

        Show
        Hari Mankude added a comment - I agree, if you're talking about prefixing it at the beginning of the file, before the first transaction. But, if you're talking about actually putting it in the content of the first transaction, I think it's a bad idea for the reason above. Todd, if you are referring to creating a edit log with the name format edit_log_<epoch_num> in_progress or when finalized edit_log <epoch_number> <start_txid> <end_txid>, it is a better solution that creating a seperate metadata file. Otherwise, Suresh's solution in adding the epoch number in start log segment sounds good. Actually, for debugging purposes, we should add more information such as time when the journal was started, NN id of owner etc along with epoch number. Basically convert OP_START_LOG_SEGMENT to hold journal header info.
        Hide
        Todd Lipcon added a comment -

        Todd, if you are referring to creating a edit log with the name format edit_log_<epoch_num>in_progress or when finalized edit_log<epoch_number><start_txid><end_txid>, it is a better solution that creating a seperate metadata file.

        Sure, that works too. Except you'll have to change a ton of FileJournalManager code paths to do this...

        Otherwise, Suresh's solution in adding the epoch number in start log segment sounds good.

        I still think that's really wrong, because transaction data is separate from transaction storage. Epoch numbers are a storage layer thing.

        Actually, for debugging purposes, we should add more information such as time when the journal was started, NN id of owner etc along with epoch number

        I agree with all of the above, except for the epoch number. The timestamp, NN id, hostname, etc, are all NN-layer things, whereas the epoch number is an edits storage layer thing.

        Show
        Todd Lipcon added a comment - Todd, if you are referring to creating a edit log with the name format edit_log_<epoch_num>in_progress or when finalized edit_log<epoch_number><start_txid><end_txid>, it is a better solution that creating a seperate metadata file. Sure, that works too. Except you'll have to change a ton of FileJournalManager code paths to do this... Otherwise, Suresh's solution in adding the epoch number in start log segment sounds good. I still think that's really wrong, because transaction data is separate from transaction storage . Epoch numbers are a storage layer thing. Actually, for debugging purposes, we should add more information such as time when the journal was started, NN id of owner etc along with epoch number I agree with all of the above, except for the epoch number. The timestamp, NN id, hostname, etc, are all NN-layer things, whereas the epoch number is an edits storage layer thing.

          People

          • Assignee:
            Unassigned
            Reporter:
            Suresh Srinivas
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development