Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: HA branch (HDFS-1623)
    • Component/s: ha
    • Labels:
      None

      Description

      There are a number of "TODO(HA)" and "TODO:HA" comments we need to fix or remove.

      1. HDFS-2920-HDFS-1623.patch
        35 kB
        Aaron T. Myers
      2. HDFS-2920-HDFS-1623.patch
        44 kB
        Aaron T. Myers
      3. hdfs-2920-v1.txt
        32 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-HAbranch-build #92 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/92/)
          HDFS-2920. fix remaining TODO items. Contributed by Aaron T. Myers and Todd Lipcon. (Revision 1294923)

          Result = UNSTABLE
          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294923
          Files :

          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/ClientDatanodeProtocolTranslatorPB.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolManager.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/UpgradeManagerDatanode.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/UpgradeObjectDatanode.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockRecovery.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeMultipleRegistrations.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestRefreshNamenodes.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #92 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/92/ ) HDFS-2920 . fix remaining TODO items. Contributed by Aaron T. Myers and Todd Lipcon. (Revision 1294923) Result = UNSTABLE atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294923 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/ClientDatanodeProtocolTranslatorPB.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolManager.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/UpgradeManagerDatanode.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/UpgradeObjectDatanode.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockRecovery.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeMultipleRegistrations.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestRefreshNamenodes.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
          Aaron T. Myers made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Fix Version/s HA branch (HDFS-1623) [ 12317568 ]
          Resolution Fixed [ 1 ]
          Hide
          Aaron T. Myers added a comment -

          Thanks a lot for the reviews, Eli. I've just committed this to the HA branch.

          Show
          Aaron T. Myers added a comment - Thanks a lot for the reviews, Eli. I've just committed this to the HA branch.
          Hide
          Eli Collins added a comment -

          +1

          Show
          Eli Collins added a comment - +1
          Aaron T. Myers made changes -
          Attachment HDFS-2920-HDFS-1623.patch [ 12516392 ]
          Hide
          Aaron T. Myers added a comment -

          Thanks a lot for the review, Eli. Here's an updated patch which takes out the delayed shutdown stuff and the health check. I've filed HDFS-3026 and HDFS-3027 to address those issues.

          Show
          Aaron T. Myers added a comment - Thanks a lot for the review, Eli. Here's an updated patch which takes out the delayed shutdown stuff and the health check. I've filed HDFS-3026 and HDFS-3027 to address those issues.
          Hide
          Eli Collins added a comment -

          Hey ATM, how about putting the NN health check implementation and delayed shutdown to separate jiras? They seem like discrete, self-contained changes (and things we can handle post merge if we want, though I think we can iterate on them quickly today). Only have a couple comments on these parts, everything else is good to go.

          • !nameNodeHasResourcesAvailable implies "The NameNode has run out of resources" instead of the
            "NameNode is low on resources". It would be even better if the message was more specific (eg mentioned lack of inodes of disk space).
          • Wrt delayed shutdown, we likely have (or should have) similar code elsewhere right since there's nothing HA specific? Why is the shutdown delayed rather than immediate?
          • Wrt "Error encountered during state transition.", isn't the error most likely due to a failure to start a service?
          Show
          Eli Collins added a comment - Hey ATM, how about putting the NN health check implementation and delayed shutdown to separate jiras? They seem like discrete, self-contained changes (and things we can handle post merge if we want, though I think we can iterate on them quickly today). Only have a couple comments on these parts, everything else is good to go. !nameNodeHasResourcesAvailable implies "The NameNode has run out of resources" instead of the "NameNode is low on resources". It would be even better if the message was more specific (eg mentioned lack of inodes of disk space). Wrt delayed shutdown, we likely have (or should have) similar code elsewhere right since there's nothing HA specific? Why is the shutdown delayed rather than immediate? Wrt "Error encountered during state transition.", isn't the error most likely due to a failure to start a service?
          Aaron T. Myers made changes -
          Attachment HDFS-2920-HDFS-1623.patch [ 12516273 ]
          Hide
          Aaron T. Myers added a comment -

          Here's an updated patch on top of the one Todd posted, which should address most remaining TODOs.

          The differences between this patch and the last are:

          • Remove TODOs from ClientProtocol regarding idempotency. setBalancerBandwidth is indeed idemptoent. The other two are upgrade-related, which is out of scope for HA.
          • Remove two now-useless TODOs from FSImage, which were addressed by other JIRAs.
          • Remove TODO from NameNode about implementing checkHealth. Implement a simple health check which just checks for available NN resources.
          • Remove a TODO from NameNode about the need to handle partially-failed state transitions. Implement delayed shutdown in case of failure to complete state transition.

          I left in two TODO(HA)s:

          1. TODOs regarding testing the BK journal manager. I filed HDFS-3022 to track this.
          2. A seemingly low-priority TODO about a possible race condition when canceling checkpoints. This race condition is relatively innocuous, as described in this comment, and is tracked by this JIRA: HDFS-2800
          Show
          Aaron T. Myers added a comment - Here's an updated patch on top of the one Todd posted, which should address most remaining TODOs. The differences between this patch and the last are: Remove TODOs from ClientProtocol regarding idempotency. setBalancerBandwidth is indeed idemptoent. The other two are upgrade-related, which is out of scope for HA. Remove two now-useless TODOs from FSImage, which were addressed by other JIRAs. Remove TODO from NameNode about implementing checkHealth. Implement a simple health check which just checks for available NN resources. Remove a TODO from NameNode about the need to handle partially-failed state transitions. Implement delayed shutdown in case of failure to complete state transition. I left in two TODO(HA)s: TODOs regarding testing the BK journal manager. I filed HDFS-3022 to track this. A seemingly low-priority TODO about a possible race condition when canceling checkpoints. This race condition is relatively innocuous, as described in this comment , and is tracked by this JIRA: HDFS-2800
          Aaron T. Myers made changes -
          Assignee Todd Lipcon [ tlipcon ] Aaron T. Myers [ atm ]
          Hide
          Eli Collins added a comment -

          +1 looks good

          Show
          Eli Collins added a comment - +1 looks good
          Todd Lipcon made changes -
          Attachment hdfs-2920-v1.txt [ 12515665 ]
          Hide
          Todd Lipcon added a comment -

          Attached is a preliminary patch that addresses a bunch of TODOs. This doesn't address all of them, but since this is sort of a "grab bag" issue I wanted to put it up for early review now, and will add some more deltas on top in the next couple days.

          Summary of changes:


          .../src/main/java/org/apache/hadoop/ipc/RPC.java | 6 ++
          .../java/org/apache/hadoop/hdfs/DFSClient.java | 15 +----
          .../ClientDatanodeProtocolTranslatorPB.java | 9 ++-
          Fixes a bunch of warnings I saw in the logs where RPC.stopProxy() was being called on translator objects. This would partially apply to trunk, but some of the HA changes we've done made it worse here, and would conflict anyway, so addressed in this patch.


          .../hdfs/server/blockmanagement/BlockManager.java | 1 -
          Remove unused import


          .../hdfs/server/datanode/BPOfferService.java | 14 +---

          • Remove a TODO about a potential synchronization problem - we we haven't seen it in practice, and it's only in shutdown, so I'm not too concerned.
          • remove getNNSocketAddress since a BPOS now has multiple NNs associated.
          • undeprecate getActiveNN, remove the TODO, since it's actually used by block synchronization (one of the few cases where we want to RPC directly to whichever NN is active and not the other). This is also used by the (now-abandoned) "distributed upgrade" code.

          .../hdfs/server/datanode/BPServiceActor.java | 4 +-

          Remove TODO about the way that addBlockPool is called. It strikes me as weird that we initialize the scanner after every heartbeat, but it's not related to HA.


          .../hdfs/server/datanode/BlockPoolManager.java | 10 —

          Remove getter which takes an NN address, since a block pool can correspond to multiple NN addresses.


          .../hadoop/hdfs/server/datanode/DataNode.java | 71 ++++++++++++--------

          • refactor cases where errorReport was directly called on an NN proxy to be passed through the DataNode->BPOS path more centrally. errorReports should go to both nodes.
          • remove getNameNodeAddr(bpid) since again, a BP may have multiple NN addresses
          • rename getBPNamenode to getActiveNamenodeForBP to clarify the result in the case of HA
          • fix getNamenodeAddresses (part of JMX bean) to properly include all HA NNs
          • rename isBPServiceAlive to isConnectedToNN to clarify its purpose - this is only used by MiniDFS currently, in order to determine when each DN has connected to the NNs in the cluster.

          .../server/datanode/UpgradeManagerDatanode.java | 2 +-
          .../server/datanode/UpgradeObjectDatanode.java | 11 +---

          (updates based on above method renames)


          .../hadoop/hdfs/server/namenode/FSDirectory.java | 15 ++---

          • remove a TODO about a useless lock - was taking a writeLock despite asserting just above that the writeLock was already held

          .../hdfs/server/namenode/FSEditLogLoader.java | 4 +-

          • remove a TODO and replace it with an explanation about why the code is counter-intuitive there

          .../hadoop/hdfs/server/namenode/FSImage.java | 4 -

          • remove TODO about never saving a checkpoint on startup in HA. We addressed this issue with HDFS-2794.
          • remove TODO about quota tracking being slow - filed HDFS-2989

          .../hadoop/hdfs/server/namenode/FSNamesystem.java | 2 -
          .../hadoop/hdfs/server/namenode/NameNode.java | 1 -
          Removed three TODOs which were really more like idle design thoughts, don't seme to be relevant.


          .../org/apache/hadoop/hdfs/MiniDFSCluster.java | 6 +-
          .../hdfs/server/datanode/TestBlockRecovery.java | 6 +-
          .../TestDataNodeMultipleRegistrations.java | 20 ++++--
          .../hdfs/server/datanode/TestRefreshNamenodes.java | 28 +++++---
          Just tracking renames above


          .../server/namenode/ha/TestPipelinesFailover.java | 4 +-
          .../server/namenode/ha/TestStandbyCheckpoints.java | 3 -
          Stray TODOs which were already addressed previously

          Show
          Todd Lipcon added a comment - Attached is a preliminary patch that addresses a bunch of TODOs. This doesn't address all of them, but since this is sort of a "grab bag" issue I wanted to put it up for early review now, and will add some more deltas on top in the next couple days. Summary of changes: .../src/main/java/org/apache/hadoop/ipc/RPC.java | 6 ++ .../java/org/apache/hadoop/hdfs/DFSClient.java | 15 +---- .../ClientDatanodeProtocolTranslatorPB.java | 9 ++- Fixes a bunch of warnings I saw in the logs where RPC.stopProxy() was being called on translator objects. This would partially apply to trunk, but some of the HA changes we've done made it worse here, and would conflict anyway, so addressed in this patch. .../hdfs/server/blockmanagement/BlockManager.java | 1 - Remove unused import .../hdfs/server/datanode/BPOfferService.java | 14 +--- Remove a TODO about a potential synchronization problem - we we haven't seen it in practice, and it's only in shutdown, so I'm not too concerned. remove getNNSocketAddress since a BPOS now has multiple NNs associated. undeprecate getActiveNN , remove the TODO, since it's actually used by block synchronization (one of the few cases where we want to RPC directly to whichever NN is active and not the other). This is also used by the (now-abandoned) "distributed upgrade" code. .../hdfs/server/datanode/BPServiceActor.java | 4 +- Remove TODO about the way that addBlockPool is called. It strikes me as weird that we initialize the scanner after every heartbeat, but it's not related to HA. .../hdfs/server/datanode/BlockPoolManager.java | 10 — Remove getter which takes an NN address, since a block pool can correspond to multiple NN addresses. .../hadoop/hdfs/server/datanode/DataNode.java | 71 ++++++++++++-------- refactor cases where errorReport was directly called on an NN proxy to be passed through the DataNode->BPOS path more centrally. errorReports should go to both nodes. remove getNameNodeAddr(bpid) since again, a BP may have multiple NN addresses rename getBPNamenode to getActiveNamenodeForBP to clarify the result in the case of HA fix getNamenodeAddresses (part of JMX bean) to properly include all HA NNs rename isBPServiceAlive to isConnectedToNN to clarify its purpose - this is only used by MiniDFS currently, in order to determine when each DN has connected to the NNs in the cluster. .../server/datanode/UpgradeManagerDatanode.java | 2 +- .../server/datanode/UpgradeObjectDatanode.java | 11 +--- (updates based on above method renames) .../hadoop/hdfs/server/namenode/FSDirectory.java | 15 ++--- remove a TODO about a useless lock - was taking a writeLock despite asserting just above that the writeLock was already held .../hdfs/server/namenode/FSEditLogLoader.java | 4 +- remove a TODO and replace it with an explanation about why the code is counter-intuitive there .../hadoop/hdfs/server/namenode/FSImage.java | 4 - remove TODO about never saving a checkpoint on startup in HA. We addressed this issue with HDFS-2794 . remove TODO about quota tracking being slow - filed HDFS-2989 .../hadoop/hdfs/server/namenode/FSNamesystem.java | 2 - .../hadoop/hdfs/server/namenode/NameNode.java | 1 - Removed three TODOs which were really more like idle design thoughts, don't seme to be relevant. .../org/apache/hadoop/hdfs/MiniDFSCluster.java | 6 +- .../hdfs/server/datanode/TestBlockRecovery.java | 6 +- .../TestDataNodeMultipleRegistrations.java | 20 ++++-- .../hdfs/server/datanode/TestRefreshNamenodes.java | 28 +++++--- Just tracking renames above .../server/namenode/ha/TestPipelinesFailover.java | 4 +- .../server/namenode/ha/TestStandbyCheckpoints.java | 3 - Stray TODOs which were already addressed previously
          Aaron T. Myers made changes -
          Link This issue is related to HDFS-2928 [ HDFS-2928 ]
          Hide
          Todd Lipcon added a comment -

          I'm going to work on this, and also address a few other misc cleanup items under the same heading (eg removing some private/internal methods that were marked deprecated and need to finish getting ripped out)

          Show
          Todd Lipcon added a comment - I'm going to work on this, and also address a few other misc cleanup items under the same heading (eg removing some private/internal methods that were marked deprecated and need to finish getting ripped out)
          Todd Lipcon made changes -
          Field Original Value New Value
          Assignee Todd Lipcon [ tlipcon ]
          Eli Collins created issue -

            People

            • Assignee:
              Aaron T. Myers
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development