Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: HA branch (HDFS-1623)
    • Component/s: ha, namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      To enable hot standby namenode, the standby node must have current information for - namenode state (image + edits) and block location information. This jira addresses keeping the namenode state current in the standby node. To do this, the proposed solution in this jira is to use a shared storage to store the namenode state.

      Note one could also build an alternative solution by augmenting the backup node. A seperate jira could explore this.

      1. hdfs-1975.txt
        14 kB
        Todd Lipcon
      2. hdfs-1975.txt
        14 kB
        Todd Lipcon
      3. HDFS-1975-HA.2.patch
        23 kB
        Jitendra Nath Pandey
      4. HDFS-1975-HA.patch
        14 kB
        Jitendra Nath Pandey
      5. HDFS-1975-HDFS-1623.patch
        76 kB
        Todd Lipcon
      6. HDFS-1975-HDFS-1623.patch
        73 kB
        Todd Lipcon
      7. HDFS-1975-HDFS-1623.patch
        82 kB
        Aaron T. Myers
      8. HDFS-1975-HDFS-1623.patch
        82 kB
        Aaron T. Myers

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Committed to HA branch. Thanks Jitendra and Aaron for the original revs of the patch, and Eli for reviewing.

          Show
          Todd Lipcon added a comment - Committed to HA branch. Thanks Jitendra and Aaron for the original revs of the patch, and Eli for reviewing.
          Hide
          Eli Collins added a comment -

          Delta lgtm

          Show
          Eli Collins added a comment - Delta lgtm
          Hide
          Todd Lipcon added a comment -

          Before committing I ran the rest of the edit-log tests and discovered a couple small issues. This patch has the following fixes:

          • fix TestEditLog to init journals for write to fix failing test cases
          • fix FSNamesystem to not use /tmp/hadoop-name as a default shared edits directory (fixes a failing TestEditLog case)
          • add shared edits dirs to hdfs-default.xml
          • fix MiniDFSCluster to not fail if running HA tests on a clean build dir (Files.deleteRecursively was failing if the dir didn't exist)
          • remove empty diff in HAUtil
          • rebase on new tip of branch

          Since the above changes are pretty simple I'll commit under Eli's +1 tomorrow morning .

          Show
          Todd Lipcon added a comment - Before committing I ran the rest of the edit-log tests and discovered a couple small issues. This patch has the following fixes: fix TestEditLog to init journals for write to fix failing test cases fix FSNamesystem to not use /tmp/hadoop-name as a default shared edits directory (fixes a failing TestEditLog case) add shared edits dirs to hdfs-default.xml fix MiniDFSCluster to not fail if running HA tests on a clean build dir (Files.deleteRecursively was failing if the dir didn't exist) remove empty diff in HAUtil rebase on new tip of branch Since the above changes are pretty simple I'll commit under Eli's +1 tomorrow morning .
          Hide
          Eli Collins added a comment -

          +1 to the latest patch. Nit: you can remove the diff against HAUtil.

          Show
          Eli Collins added a comment - +1 to the latest patch. Nit: you can remove the diff against HAUtil.
          Hide
          Todd Lipcon added a comment -

          Opened HDFS-2602 and HDFS-2603 for the OP_ADD issue and the replication/invalidation issue. Not opening one to track txid through all the commands yet, since it's not clear it's necessary.

          Show
          Todd Lipcon added a comment - Opened HDFS-2602 and HDFS-2603 for the OP_ADD issue and the replication/invalidation issue. Not opening one to track txid through all the commands yet, since it's not clear it's necessary.
          Hide
          Eli Collins added a comment -

          Agree, let's get this one in and have a separate jira for each, no reason they need to be closed out first and this way we can attack them in parallel.

          Show
          Eli Collins added a comment - Agree, let's get this one in and have a separate jira for each, no reason they need to be closed out first and this way we can attack them in parallel.
          Hide
          Todd Lipcon added a comment -

          I chatted with ATM offline about this. In our opinion it makes the most sense to commit this patch pretty much as-is, even though we know there are some issues as described above. Then, we can open separate follow-on JIRAs for each of the distinct issues:

          • one JIRA for dealing with the standby losing block locations because OP_ADD and OP_CLOSE replace the BlockInfos
          • one JIRA to make the standby not manage replication and invalidation queues until it enters active mode
          • one JIRA to figure out if we need to track txids through the NN<->DN messaging instead of just using the genstamps as Jitendra has done in the patch here.

          Having this basic tailing infrastructure committed will let us start to investigate and fix the above issues in parallel and make faster progress.

          Agreed?

          Show
          Todd Lipcon added a comment - I chatted with ATM offline about this. In our opinion it makes the most sense to commit this patch pretty much as-is, even though we know there are some issues as described above. Then, we can open separate follow-on JIRAs for each of the distinct issues: one JIRA for dealing with the standby losing block locations because OP_ADD and OP_CLOSE replace the BlockInfos one JIRA to make the standby not manage replication and invalidation queues until it enters active mode one JIRA to figure out if we need to track txids through the NN<->DN messaging instead of just using the genstamps as Jitendra has done in the patch here. Having this basic tailing infrastructure committed will let us start to investigate and fix the above issues in parallel and make faster progress. Agreed?
          Hide
          Todd Lipcon added a comment -

          Another issue that we have to tackle before this can provide hot standby is this:

          When we close a file, or add another block to a file, we write OP_CLOSE or OP_ADD in the txn log. FSEditLogLoader, when it sees these types of transactions, creates new BlockInfo objects for all of the blocks listed in the transaction. These new BlockInfos have no block locations associated. So, when we close a file, the SBNN loses its block locations info for that file and is no longer "hot".

          I have an ugly hack which copies over the old BlockInfos from the existing INode, but I'm not convinced it's the right way. It might be cleaner to add new opcode types like OP_ADD_ADDITIONAL_BLOCK, and actually treat OP_CLOSE as just a finalization of INodeFileUnderConstruction to INodeFile, rather than replacing block info at all.

          Thoughts?

          Show
          Todd Lipcon added a comment - Another issue that we have to tackle before this can provide hot standby is this: When we close a file, or add another block to a file, we write OP_CLOSE or OP_ADD in the txn log. FSEditLogLoader, when it sees these types of transactions, creates new BlockInfo objects for all of the blocks listed in the transaction. These new BlockInfos have no block locations associated. So, when we close a file, the SBNN loses its block locations info for that file and is no longer "hot". I have an ugly hack which copies over the old BlockInfos from the existing INode, but I'm not convinced it's the right way. It might be cleaner to add new opcode types like OP_ADD_ADDITIONAL_BLOCK, and actually treat OP_CLOSE as just a finalization of INodeFileUnderConstruction to INodeFile, rather than replacing block info at all. Thoughts?
          Hide
          Todd Lipcon added a comment -

          another possible answer is that we need to modify FSNamesystem.isPopulatingReplQueues to return false on the standby, and then when it switches from standby to active, initialize the replication queues only after reading the latest edits... I think that will solve the SET_REPLICATION issue, but not certain if it will solve all the issues in this general class.

          Show
          Todd Lipcon added a comment - another possible answer is that we need to modify FSNamesystem.isPopulatingReplQueues to return false on the standby, and then when it switches from standby to active, initialize the replication queues only after reading the latest edits... I think that will solve the SET_REPLICATION issue, but not certain if it will solve all the issues in this general class.
          Hide
          Todd Lipcon added a comment -

          Attached patch rebased on top of latest HDFS-2582, HDFS-2591, HDFS-1971, in that order.

          Show
          Todd Lipcon added a comment - Attached patch rebased on top of latest HDFS-2582 , HDFS-2591 , HDFS-1971 , in that order.
          Hide
          Todd Lipcon added a comment -

          I got this patch in conjunction with HDFS-1108 and HDFS-1971 to properly replicate the creation of a new file, but then moved on to working on setReplication and ran into issues there. The issue I'm seeing is this:

          1) Active NN receives setReplication to drop some file's replication from 3 to 1
          2) It writes OP_SET_REPLICATION to its log, invalidates two replicas, and returns
          3) The DNs report BLOCK_INVALIDATED back to both the ActiveNN and SBNN.
          4) The SBNN hasn't received the OP_SET_REPLICATION yet, so it marks the block as under-replicated.

          In the case of raising replication (eg from 1 to 3) we get the opposite problem: the SBNN marks the block as over-replicated and adds two of the replicas to its invalidation list.

          Generation stamps don't help here, because changing replication level of a block doesn't change its gen-stamp (and it shouldn't).

          The solution I'm thinking of is that we have to track the transaction ID when we send comments to DNs. So, if a setReplication command at txid=123 causes invalidation of two blocks, we'd send the INVALIDATE command with "txid=123". Then, when the DN does delete these blocks, it would ack back with that txid to both NNs. The SBNN wouldn't process this message until it had loaded that txid.

          A bit of a simplification from this would be that any command being processed from an NN will include the NN's txid, which the DN records in BPOfferService as "latestCommandTxId". Then, any calls to the NN would include this txid. This is a bit more conservative than tracking it with each block command, but probably less prone to bugs.

          I plan to take a pass at implementing this latter approach.

          Show
          Todd Lipcon added a comment - I got this patch in conjunction with HDFS-1108 and HDFS-1971 to properly replicate the creation of a new file, but then moved on to working on setReplication and ran into issues there. The issue I'm seeing is this: 1) Active NN receives setReplication to drop some file's replication from 3 to 1 2) It writes OP_SET_REPLICATION to its log, invalidates two replicas, and returns 3) The DNs report BLOCK_INVALIDATED back to both the ActiveNN and SBNN. 4) The SBNN hasn't received the OP_SET_REPLICATION yet, so it marks the block as under-replicated. In the case of raising replication (eg from 1 to 3) we get the opposite problem: the SBNN marks the block as over-replicated and adds two of the replicas to its invalidation list. Generation stamps don't help here, because changing replication level of a block doesn't change its gen-stamp (and it shouldn't). The solution I'm thinking of is that we have to track the transaction ID when we send comments to DNs. So, if a setReplication command at txid=123 causes invalidation of two blocks, we'd send the INVALIDATE command with "txid=123". Then, when the DN does delete these blocks, it would ack back with that txid to both NNs. The SBNN wouldn't process this message until it had loaded that txid. A bit of a simplification from this would be that any command being processed from an NN will include the NN's txid, which the DN records in BPOfferService as "latestCommandTxId". Then, any calls to the NN would include this txid. This is a bit more conservative than tracking it with each block command, but probably less prone to bugs. I plan to take a pass at implementing this latter approach.
          Hide
          Todd Lipcon added a comment -

          How about update the generation stamp after logging the new block

          The issue is that we need the new block to have the new genstamp, so we have to call nextGenerationStamp before we create the new block. Alternatively, we could split up nextGenerationStamp into two parts – one which increments, and another which logs. But then we may have an issue if there is a crash so the edit log includes the block allocation but not the SET_GENERATION_STAMP.

          I've temporarily worked around this by changing the FSEditLogLoader code to only call notifyGenStampUpdate in OP_ADD, not in OP_SET_GENSTAMP. So, the new block is added to the block manager before it's notified as OK to handle the DN messages.

          BUT – there's another issue with using genstamps as our "gating" mechanism for the DN messages - comment to follow.

          Show
          Todd Lipcon added a comment - How about update the generation stamp after logging the new block The issue is that we need the new block to have the new genstamp, so we have to call nextGenerationStamp before we create the new block. Alternatively, we could split up nextGenerationStamp into two parts – one which increments, and another which logs. But then we may have an issue if there is a crash so the edit log includes the block allocation but not the SET_GENERATION_STAMP. I've temporarily worked around this by changing the FSEditLogLoader code to only call notifyGenStampUpdate in OP_ADD , not in OP_SET_GENSTAMP . So, the new block is added to the block manager before it's notified as OK to handle the DN messages. BUT – there's another issue with using genstamps as our "gating" mechanism for the DN messages - comment to follow.
          Hide
          Jitendra Nath Pandey added a comment -

          How about update the generation stamp after logging the new block? The pipeline will be setup with the new generation stamp and block received will also contain the new gen-stamp.

          Show
          Jitendra Nath Pandey added a comment - How about update the generation stamp after logging the new block? The pipeline will be setup with the new generation stamp and block received will also contain the new gen-stamp.
          Hide
          Todd Lipcon added a comment -

          I'm trying to write a test that combines this with HDFS-1971, HDFS-2591, and HDFS-1108, and ran into an issue with the design:

          With HDFS-1108, we persist blocks, but the order of the edits in the edit log is:

          2: SET_GENERATION_STAMP 1001
          3: ADD /testStandbyIsHot with no blocks
          4: SET_GENERATION_STAMP 1002
          5: ADD /testStandbyIsHot with the block id blk_486254483591558448_1002 length=0
          6: CLOSE /testStandbyIsHot with the block id blk_486254483591558448_1002 length=1000
          

          the issue is that, when the StandbyNode is receiving the edits, it gets SET_GENERATION_STAMP=1002 before it gets the mapping of that block into the file. So, it processes the blockReceived, sees it as an invalid block not belonging to any file, and ignores it.

          Will think about a solution....

          Show
          Todd Lipcon added a comment - I'm trying to write a test that combines this with HDFS-1971 , HDFS-2591 , and HDFS-1108 , and ran into an issue with the design: With HDFS-1108 , we persist blocks, but the order of the edits in the edit log is: 2: SET_GENERATION_STAMP 1001 3: ADD /testStandbyIsHot with no blocks 4: SET_GENERATION_STAMP 1002 5: ADD /testStandbyIsHot with the block id blk_486254483591558448_1002 length=0 6: CLOSE /testStandbyIsHot with the block id blk_486254483591558448_1002 length=1000 the issue is that, when the StandbyNode is receiving the edits, it gets SET_GENERATION_STAMP=1002 before it gets the mapping of that block into the file. So, it processes the blockReceived, sees it as an invalid block not belonging to any file, and ignores it. Will think about a solution....
          Hide
          Aaron T. Myers added a comment -

          Thanks a ton for your review, Todd. Here's an updated patch which addresses all of your comments except the following.

          In EditLogFileInputStream, why do we need to pass isInProgress as a boolean? If it's in-progress, that means we don't know the lastTxId, so it would be INVALID_TXID, right? So we can implement isInProgress() by just comparing lastTxId.

          Because in the case of pre-transactional ELFIS, we pass in INVALID_TXID, even though the log is not yet in progress.

          I'm still working on producing an updated patch which also adds testing for the pending DN message queues.

          Show
          Aaron T. Myers added a comment - Thanks a ton for your review, Todd. Here's an updated patch which addresses all of your comments except the following. In EditLogFileInputStream, why do we need to pass isInProgress as a boolean? If it's in-progress, that means we don't know the lastTxId, so it would be INVALID_TXID, right? So we can implement isInProgress() by just comparing lastTxId. Because in the case of pre-transactional ELFIS, we pass in INVALID_TXID, even though the log is not yet in progress. I'm still working on producing an updated patch which also adds testing for the pending DN message queues.
          Hide
          Todd Lipcon added a comment -
          • Why is getMaxGsInBlockList static? seems it could just be a member function of BlockListAsLongs
          • Storage has a javadoc @param shouldLock but the parameter doesn't seem to be in the method signature

          In EditLogFileInputStream, why do we need to pass isInProgress as a boolean? If it's in-progress, that means we don't know the lastTxId, so it would be INVALID_TXID, right? So we can implement isInProgress() by just comparing lastTxId.


          FSEditLog.java:

          • There's a typo "UNITIALIZED" instead of "UNINITIALIZED" in the javadoc in FSEditLog
          • The comment before sharedEditsDirs in FSEditLog should be javadoc-style
          • In the FSEditLog state machine, how does the transition from OPEN_FOR_READING work when going active? The javadoc could use a little bit more there (do we go first to CLOSED?)
          • Similar to above - would be good to add Preconditions.checkState checks in initJournalsForWrite and initSharedJournalsForRead - it's not obvious what state they should be to avoid orphaning open streams, etc.
          • Assertion in logEdit: can you add a text error message like "bad state: " + state so that if the assertion fails we're left with more useful info?

          FSN:

          • Do you really need to make all of the FSN methods VisibleForTesting? We have a class called NameNodeAdapter which you can probably use to "reach in" without changing visibility. Or, why not just make a non-HA DFSClient talking to the first NN in the test case?
          • editLogTailer should be defined up higher in the file, no?
          • it seems like the recoverUnclosedStreams should happen just changing to writer mode, rather than at stopReadingEditLogs (doesn't seem obvious that this method would mutate the dir state). Otherwise when we clean-shutdown the standby, it might try to move around some logs, no?

          • Why does matchEditLogs accept null now? It's called with the result of FileUtil.listFiles which never returns null
          • A couple spurious changes in FileJournalManager, JournalManager

          MiniDFSCluster:

          • the builder method should be haEnabled not setHaEnabled to match the pattern of the other methods

          • Need a license on TestEditLogTailer
          Show
          Todd Lipcon added a comment - Why is getMaxGsInBlockList static? seems it could just be a member function of BlockListAsLongs Storage has a javadoc @param shouldLock but the parameter doesn't seem to be in the method signature In EditLogFileInputStream , why do we need to pass isInProgress as a boolean? If it's in-progress, that means we don't know the lastTxId , so it would be INVALID_TXID , right? So we can implement isInProgress() by just comparing lastTxId . FSEditLog.java: There's a typo "UNITIALIZED" instead of "UNINITIALIZED" in the javadoc in FSEditLog The comment before sharedEditsDirs in FSEditLog should be javadoc-style In the FSEditLog state machine, how does the transition from OPEN_FOR_READING work when going active? The javadoc could use a little bit more there (do we go first to CLOSED?) Similar to above - would be good to add Preconditions.checkState checks in initJournalsForWrite and initSharedJournalsForRead - it's not obvious what state they should be to avoid orphaning open streams, etc. Assertion in logEdit : can you add a text error message like "bad state: " + state so that if the assertion fails we're left with more useful info? FSN: Do you really need to make all of the FSN methods VisibleForTesting? We have a class called NameNodeAdapter which you can probably use to "reach in" without changing visibility. Or, why not just make a non-HA DFSClient talking to the first NN in the test case? editLogTailer should be defined up higher in the file, no? it seems like the recoverUnclosedStreams should happen just changing to writer mode, rather than at stopReadingEditLogs (doesn't seem obvious that this method would mutate the dir state). Otherwise when we clean-shutdown the standby, it might try to move around some logs, no? Why does matchEditLogs accept null now? It's called with the result of FileUtil.listFiles which never returns null A couple spurious changes in FileJournalManager, JournalManager MiniDFSCluster: the builder method should be haEnabled not setHaEnabled to match the pattern of the other methods Need a license on TestEditLogTailer
          Hide
          Aaron T. Myers added a comment -

          Also, I should have mentioned, this test doesn't exercise the DN message portion at all. I intend to write some tests for that and update the patch accordingly next.

          Show
          Aaron T. Myers added a comment - Also, I should have mentioned, this test doesn't exercise the DN message portion at all. I intend to write some tests for that and update the patch accordingly next.
          Hide
          Aaron T. Myers added a comment -

          Here's an updated version of the previous patch provided by Jitendra. It does the following things:

          1. Rebased on current HDFS-1623 branch.
          2. Adds infrastructure to MiniDFSCluster to be able to start HA NNs.
          3. Adds a test based roughly on Todd's earlier test.
          4. Fixes up a few corner cases.
          5. Should address all of Todd's feedback.

          Todd, could you please take a look at this? The one piece of feedback I didn't take was changing sharedEditsDirs to sharedEditsUris, since doing so would be inconsistent with all the other places in the code where we refer to those collections as "dirs."

          Show
          Aaron T. Myers added a comment - Here's an updated version of the previous patch provided by Jitendra. It does the following things: Rebased on current HDFS-1623 branch. Adds infrastructure to MiniDFSCluster to be able to start HA NNs. Adds a test based roughly on Todd's earlier test. Fixes up a few corner cases. Should address all of Todd's feedback. Todd, could you please take a look at this? The one piece of feedback I didn't take was changing sharedEditsDirs to sharedEditsUris, since doing so would be inconsistent with all the other places in the code where we refer to those collections as "dirs."
          Hide
          Todd Lipcon added a comment -

          Cool, I'll look forward to your next revision. Let me know if I can help in any way.

          Another thing I was considering while reading your patch is that it would be nice if the messages went through the same code path regardless of whether the NN is in standby or active mode. That way we have fewer code paths to debug. Does that seem feasible?

          Show
          Todd Lipcon added a comment - Cool, I'll look forward to your next revision. Let me know if I can help in any way. Another thing I was considering while reading your patch is that it would be nice if the messages went through the same code path regardless of whether the NN is in standby or active mode. That way we have fewer code paths to debug. Does that seem feasible?
          Hide
          Jitendra Nath Pandey added a comment -

          Thanks for early review Todd.

          The patch is still in works. To reduce the amount of memory required to store the pending messages I am considering following two approaches.

          1) Instead of storing the entire block report, storing only those blocks that have newer gs. This will reduce the memory required to store pending messages.

          2) Allow reading segments from the middle, but only in following two cases
          1) The segment is finalized
          2) The segment is in progress and a threshold of time has passed, just to avoid opening in_progress file too frequently.

          Show
          Jitendra Nath Pandey added a comment - Thanks for early review Todd. The patch is still in works. To reduce the amount of memory required to store the pending messages I am considering following two approaches. 1) Instead of storing the entire block report, storing only those blocks that have newer gs. This will reduce the memory required to store pending messages. 2) Allow reading segments from the middle, but only in following two cases 1) The segment is finalized 2) The segment is in progress and a threshold of time has passed, just to avoid opening in_progress file too frequently.
          Hide
          Todd Lipcon added a comment -
          • can we move EditLogTailer to the ha package?
          • should probably have some brief javadoc there. Also, if it is a public class, it needs InterfaceAudience/Stability annotations
          • why do we sleep 60 seconds between tails? I think keeping the standby as up to date as possible is important. Though it's not an immediate goal of today's project, we should keep in mind the secondary goal of serving stale reads from the standby.
          • using the terminology "sharedEditDirs" implies that we only support directories here. Instead, shouldn't we call it "sharedEditUris"? Same with the configs, etc.
          • the code in stopReadingEditLogs seems really race prone. We need better inter-thread coordination than just sleeping.
          • The name waitForGenStamp implies that it waits for something, but in fact this is just isGenStampInFuture
          • need license on PendingDataNodeMessages
          • need javadoc in lots of spots - explain the purposes of the new class, etc
          • getMaxGsInBlockList could be moved to a static method in BlockListAsLongs
          • DataNodeMessage and subclasses: make the fields final
          • needs unit tests - there are some in my earlier patch for this issue that could be used to verify EditLogTailer, I think.

          I want to also do some thinking on synchronization for the datanode messages, etc. Will write more later.

          Show
          Todd Lipcon added a comment - can we move EditLogTailer to the ha package? should probably have some brief javadoc there. Also, if it is a public class, it needs InterfaceAudience/Stability annotations why do we sleep 60 seconds between tails? I think keeping the standby as up to date as possible is important. Though it's not an immediate goal of today's project, we should keep in mind the secondary goal of serving stale reads from the standby. using the terminology "sharedEditDirs" implies that we only support directories here. Instead, shouldn't we call it "sharedEditUris"? Same with the configs, etc. the code in stopReadingEditLogs seems really race prone. We need better inter-thread coordination than just sleeping. The name waitForGenStamp implies that it waits for something, but in fact this is just isGenStampInFuture need license on PendingDataNodeMessages need javadoc in lots of spots - explain the purposes of the new class, etc getMaxGsInBlockList could be moved to a static method in BlockListAsLongs DataNodeMessage and subclasses: make the fields final needs unit tests - there are some in my earlier patch for this issue that could be used to verify EditLogTailer, I think. I want to also do some thinking on synchronization for the datanode messages, etc. Will write more later.
          Hide
          Ivan Kelly added a comment -

          initJournals checks the state.

          Ah, so it does. Ignore that comment then.

          Show
          Ivan Kelly added a comment - initJournals checks the state. Ah, so it does. Ignore that comment then.
          Hide
          Jitendra Nath Pandey added a comment -

          Im not sure the tailer will work as is. What happens if you open an inprogress input stream with this? As I understand it, you'll end up with lastTxnId in the middle of the segment.

          It will be fixed if FileJournalManager doesn't return in-progress segments. The stand by will lag a bit, but upon a failover, it will catch up. Need to evaluate the impact on number of stored datanode messages.

          In FSEditLog, check the states before transitioning them.

          initJournals checks the state. Do you mean I should rather check the state in initJournalsForRead/Write?

          Show
          Jitendra Nath Pandey added a comment - Im not sure the tailer will work as is. What happens if you open an inprogress input stream with this? As I understand it, you'll end up with lastTxnId in the middle of the segment. It will be fixed if FileJournalManager doesn't return in-progress segments. The stand by will lag a bit, but upon a failover, it will catch up. Need to evaluate the impact on number of stored datanode messages. In FSEditLog, check the states before transitioning them. initJournals checks the state. Do you mean I should rather check the state in initJournalsForRead/Write?
          Hide
          Jitendra Nath Pandey added a comment -

          The attached patch doesn't block the call, rather stores the unprocessed messages.

          Show
          Jitendra Nath Pandey added a comment - The attached patch doesn't block the call, rather stores the unprocessed messages.
          Hide
          Ivan Kelly added a comment -

          General direction looks good. I've a few comments.

          There's a comma in configuration key.

          In FSEditLog, check the states before transitioning them.

          Im not sure the tailer will work as is. What happens if you open an inprogress input stream with this? As I understand it, you'll end up with lastTxnId in the middle of the segment.

          In stopReadingEditLogs(), instead of doing the start stop, to ensure up to dateness, you could have a call on EditLogTailer#applyLatestUpdates(). Then EditLogTailerThread could call this in the loop also.

          Show
          Ivan Kelly added a comment - General direction looks good. I've a few comments. There's a comma in configuration key. In FSEditLog, check the states before transitioning them. Im not sure the tailer will work as is. What happens if you open an inprogress input stream with this? As I understand it, you'll end up with lastTxnId in the middle of the segment. In stopReadingEditLogs(), instead of doing the start stop, to ensure up to dateness, you could have a call on EditLogTailer#applyLatestUpdates(). Then EditLogTailerThread could call this in the loop also.
          Hide
          Jitendra Nath Pandey added a comment -

          This is a very early version of the patch.
          In this patch I am letting the datanode call to block if generation stamp is not up to date. I will upload a patch very soon which won't block the call and instead queue the message. The patch is not review ready yet, but indicates the approach being taken.

          Show
          Jitendra Nath Pandey added a comment - This is a very early version of the patch. In this patch I am letting the datanode call to block if generation stamp is not up to date. I will upload a patch very soon which won't block the call and instead queue the message. The patch is not review ready yet, but indicates the approach being taken.
          Hide
          Jitendra Nath Pandey added a comment -

          > I think we need to add nextGenerationStamp calls in a few places.
          I agree.

          > Have you enumerated the various coordinations that we need to consider?
          IMO, we need to consider synchronization with edit logs for any message that Datanode sends to the standby, i.e. for every method in DatanodeProtocol. I think we need synchronization in only those methods that are referring to blocks. Here is the list of all methods and my classification based on synchronization needed or not.

          1. registerDatanode :
            • I think no synchronization is needed, because there is no corresponding datanode info coming from edit logs.
          2. reportBadBlocks:
            • Synchronization is not needed because the blocks being reported bad must have been reported earlier in a block report or a block received message by the datanode. Therefore if the block is not found in the block map of the standby, it only means its a deleted block.
          3. commitBlockSynchronization:
            • Synchronization is needed for the same reason as in block received case.
          4. blockReport:
            • Synchronization is needed because standby may not even have seen a block that is reported in block report.
          5. blockReceived:
            • Synchronization is needed because standby may not even have seen a block that is reported in block received.
          6. sendHeartbeat :
            • No synchronization is needed with edit logs.
          7. errorReport:
            • Standby can just ignore this?
          8. versionRequest:
            • Standby can just ignore this?
          9. processUpgradeCommand:
            • Ignored by Standby.

          From the list above, it seems to me that coordination is only required for block related info received from datanode vs that received in edit logs. Therefore using generation stamp is a good choice because all blocks have a generation stamp. Is that a valid conclusion?

          Considering the txid approach, it seems it won't work. Consider following case:
          Standby receives a block received message and doesn't find the block in its map. It is possible for two reasons:
          a) the standby hasn't seen the edit log for the allocate block.
          b) the standby has seen and processed an allocate block and also a delete for that block.
          The standby needs to be able to distinguish between the above two possibilities to correctly process the block received.
          Now it may be possible that the allocate and/or delete happened after the last command from the namenode, and the last transaction id known to the datanode is older than the allocate/delete. Then the standby won't know how to process the received block.

          Show
          Jitendra Nath Pandey added a comment - > I think we need to add nextGenerationStamp calls in a few places. I agree. > Have you enumerated the various coordinations that we need to consider? IMO, we need to consider synchronization with edit logs for any message that Datanode sends to the standby, i.e. for every method in DatanodeProtocol. I think we need synchronization in only those methods that are referring to blocks. Here is the list of all methods and my classification based on synchronization needed or not. registerDatanode : I think no synchronization is needed, because there is no corresponding datanode info coming from edit logs. reportBadBlocks: Synchronization is not needed because the blocks being reported bad must have been reported earlier in a block report or a block received message by the datanode. Therefore if the block is not found in the block map of the standby, it only means its a deleted block. commitBlockSynchronization: Synchronization is needed for the same reason as in block received case. blockReport: Synchronization is needed because standby may not even have seen a block that is reported in block report. blockReceived: Synchronization is needed because standby may not even have seen a block that is reported in block received. sendHeartbeat : No synchronization is needed with edit logs. errorReport: Standby can just ignore this? versionRequest: Standby can just ignore this? processUpgradeCommand: Ignored by Standby. From the list above, it seems to me that coordination is only required for block related info received from datanode vs that received in edit logs. Therefore using generation stamp is a good choice because all blocks have a generation stamp. Is that a valid conclusion? Considering the txid approach, it seems it won't work. Consider following case: Standby receives a block received message and doesn't find the block in its map. It is possible for two reasons: a) the standby hasn't seen the edit log for the allocate block. b) the standby has seen and processed an allocate block and also a delete for that block. The standby needs to be able to distinguish between the above two possibilities to correctly process the block received. Now it may be possible that the allocate and/or delete happened after the last command from the namenode, and the last transaction id known to the datanode is older than the allocate/delete. Then the standby won't know how to process the received block.
          Hide
          Suresh Srinivas added a comment -

          I and Jitendra considered Transaction ID before looking at GS. Transaction ID does not work because there are three parties involved - client, datanode and namenode.

          Take this example:

          1. DN1 sends heartbeat to primary NN at txid T and learns about T.
          2. A client meanwhile creates a file at T+1 and allocates block at T+2.
          3. DN1 now is unable to send heartbeat or communicate with primary NN. Hence it is stuck at transaction T.
          4. Client complets writing a block to DN1. DN1 reports this to backup node as block received with T. At this point in time, if SNN has reached T and has not processed T+1 or T+2, it tries to handle BR(T), because it can. However, it fails to process it without the knowledge of the file.

          We could get around this, if client also is tracking transactions and sends it to the datanode, adding unnecessary complexity and changes.

          Show
          Suresh Srinivas added a comment - I and Jitendra considered Transaction ID before looking at GS. Transaction ID does not work because there are three parties involved - client, datanode and namenode. Take this example: DN1 sends heartbeat to primary NN at txid T and learns about T. A client meanwhile creates a file at T+1 and allocates block at T+2. DN1 now is unable to send heartbeat or communicate with primary NN. Hence it is stuck at transaction T. Client complets writing a block to DN1. DN1 reports this to backup node as block received with T. At this point in time, if SNN has reached T and has not processed T+1 or T+2, it tries to handle BR(T), because it can. However, it fails to process it without the knowledge of the file. We could get around this, if client also is tracking transactions and sends it to the datanode, adding unnecessary complexity and changes.
          Hide
          Todd Lipcon added a comment -

          Hi Jitendra. If you plan to use the generation stamp to synchronize the DN's block info with the edit stream, I think we need to add nextGenerationStamp calls in a few places. In particular, in allocateBlock, we don't bump the generation stamp, so the second block of a new file will have the same GS as the first block if no other actions happen.

          Have you enumerated the various coordinations that we need to consider? The above deals with allocateBlock (namespace op) vs blockReceived (dn op), but I wonder if there are other places we need to synchronize the DN action after some NN action and aren't bumping the GS.

          One thing I was considering was threading transaction IDs through the various operations - for example one possibility is for the active NN to send the current txid to the DN in every DatanodeCommand. Then, any reports from DN->NN include that txid, and the standby can block until it's hit that txid. Using txid instead of generation stamp means we don't have to consider each type of operation as a special case. Thoughts?

          Show
          Todd Lipcon added a comment - Hi Jitendra. If you plan to use the generation stamp to synchronize the DN's block info with the edit stream, I think we need to add nextGenerationStamp calls in a few places. In particular, in allocateBlock, we don't bump the generation stamp, so the second block of a new file will have the same GS as the first block if no other actions happen. Have you enumerated the various coordinations that we need to consider? The above deals with allocateBlock (namespace op) vs blockReceived (dn op), but I wonder if there are other places we need to synchronize the DN action after some NN action and aren't bumping the GS. One thing I was considering was threading transaction IDs through the various operations - for example one possibility is for the active NN to send the current txid to the DN in every DatanodeCommand. Then, any reports from DN->NN include that txid, and the standby can block until it's hit that txid. Using txid instead of generation stamp means we don't have to consider each type of operation as a special case. Thoughts?
          Hide
          Jitendra Nath Pandey added a comment -

          There are two aspects to this problem.

          1. How standby gets the edit logs and reads them?
          2. How it synchronizes the edit logs with the updates coming from the datanodes?

          I am taking following approach.

          Access to logs:
          ---------------

          1. A new configuration parameter will be added, which tells the locations of edit logs that standby can use to read. This configuration will have subset of locations that are configured as edit log locations on the primary.
          2. The standby will use JournalSet implementation of the JournalManager interface to read the transactions. (Refer HDFS-1580, HDFS-2158).
          3. If no transactions are available EditLogInputStream throws NoMoreTransactionsException, the standby just sleeps for a short timeout, and retries again.
          4. On failover, the standby reads the edit logs for the last time after the primary has been fenced in following steps:
            • close the edit log input stream.
            • open the edit log input stream.
            • read
            • close
              A close in first step could be useful since edit log is stored on nfs, to make sure that latest changes are visible.

          Updates from datanodes
          ----------------------
          Standby maintains latest generation stamp, based on the records from the editlog. This is used for processing updates from datanodes as follows:

          1. The standby receives block report/block received/block deleted (pending HDFS-395) message from the datanode.
            • If the block's GS is less than the current GS of the standby, the standby will process it.
            • If the block's GS is greater than the current GS of the standby, the standby will buffer this block received/deleted and waits until it sees corresponding generation stamp in the edit log. Otherwise it processes the block received.
          Show
          Jitendra Nath Pandey added a comment - There are two aspects to this problem. How standby gets the edit logs and reads them? How it synchronizes the edit logs with the updates coming from the datanodes? I am taking following approach. Access to logs: --------------- A new configuration parameter will be added, which tells the locations of edit logs that standby can use to read. This configuration will have subset of locations that are configured as edit log locations on the primary. The standby will use JournalSet implementation of the JournalManager interface to read the transactions. (Refer HDFS-1580 , HDFS-2158 ). If no transactions are available EditLogInputStream throws NoMoreTransactionsException, the standby just sleeps for a short timeout, and retries again. On failover, the standby reads the edit logs for the last time after the primary has been fenced in following steps: close the edit log input stream. open the edit log input stream. read close A close in first step could be useful since edit log is stored on nfs, to make sure that latest changes are visible. Updates from datanodes ---------------------- Standby maintains latest generation stamp, based on the records from the editlog. This is used for processing updates from datanodes as follows: The standby receives block report/block received/block deleted (pending HDFS-395 ) message from the datanode. If the block's GS is less than the current GS of the standby, the standby will process it. If the block's GS is greater than the current GS of the standby, the standby will buffer this block received/deleted and waits until it sees corresponding generation stamp in the edit log. Otherwise it processes the block received.
          Hide
          Aaron T. Myers added a comment -

          Looks like a pretty good first draft, Todd. Since this isn't intended for commit, I didn't review it for style. One question:

          Given that the lagLenth is set to 1000, will there be a problem if an FSEditLogOp is serialized to more than 1000 bytes? Or if an EOF is encountered in the middle of a serialized op? I suspect some exception might get thrown in this case which would cause the tailer to fail.

          Show
          Aaron T. Myers added a comment - Looks like a pretty good first draft, Todd. Since this isn't intended for commit, I didn't review it for style. One question: Given that the lagLenth is set to 1000, will there be a problem if an FSEditLogOp is serialized to more than 1000 bytes? Or if an EOF is encountered in the middle of a serialized op? I suspect some exception might get thrown in this case which would cause the tailer to fail.
          Hide
          Todd Lipcon added a comment -

          Oops, I had missed a git add in the previous upload

          Show
          Todd Lipcon added a comment - Oops, I had missed a git add in the previous upload
          Hide
          Todd Lipcon added a comment -

          Here's a sketch of some basic code to do this. The approach is similar to what's done in the AvatarNode – a thread follows the most recent edit file with a predefined "lag" behind the file length.

          This is by no means done, but wanted to get some code started.

          Show
          Todd Lipcon added a comment - Here's a sketch of some basic code to do this. The approach is similar to what's done in the AvatarNode – a thread follows the most recent edit file with a predefined "lag" behind the file length. This is by no means done, but wanted to get some code started.

            People

            • Assignee:
              Jitendra Nath Pandey
              Reporter:
              Suresh Srinivas
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development