Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10206

Datanodes not sorted properly by distance when the reader isn't a datanode

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-alpha2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      If the DFSClient machine is not a datanode, but it shares its rack with some datanodes of the HDFS block requested, DatanodeManager#sortLocatedBlocks might not put the local-rack datanodes at the beginning of the sorted list. That is because the function didn't call networktopology.add(client); to properly set the node's parent node; something required by networktopology.sortByDistance to compute distance between two nodes in the same topology tree.

      Another issue with networktopology.sortByDistance is it only distinguishes local rack from remote rack, but it doesn't support general distance calculation to tell how remote the rack is.

      NetworkTopology.java
        protected int getWeight(Node reader, Node node) {
          // 0 is local, 1 is same rack, 2 is off rack
          // Start off by initializing to off rack
          int weight = 2;
          if (reader != null) {
            if (reader.equals(node)) {
              weight = 0;
            } else if (isOnSameRack(reader, node)) {
              weight = 1;
            }
          }
          return weight;
        }
      

      HDFS-10203 has suggested moving the sorting from namenode to DFSClient to address another issue. Regardless of where we do the sorting, we still need fix the issues outline here.

      Note that BlockPlacementPolicyDefault shares the same NetworkTopology object used by DatanodeManager and requires Nodes stored in the topology to be DatanodeDescriptor for block placement. So we need to make sure we don't pollute the NetworkTopology if we plan to fix it on the server side.

      1. HDFS-10206.000.patch
        6 kB
        Nanda kumar
      2. HDFS-10206.001.patch
        11 kB
        Nanda kumar
      3. HDFS-10206.002.patch
        11 kB
        Nanda kumar
      4. HDFS-10206.003.patch
        11 kB
        Nanda kumar
      5. HDFS-10206-branch-2.8.003.patch
        11 kB
        Nanda kumar

        Activity

        Hide
        nandakumar131 Nanda kumar added a comment -

        Thanks Ming Ma for the reviews and commit.

        Show
        nandakumar131 Nanda kumar added a comment - Thanks Ming Ma for the reviews and commit.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10959 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10959/)
        HDFS-10206. Datanodes not sorted properly by distance when the reader (mingma: rev c73e08a6dad46cad14b38a4a586a5cda1622b206)

        • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/net/TestNetworkTopology.java
        • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
        • (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10959 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10959/ ) HDFS-10206 . Datanodes not sorted properly by distance when the reader (mingma: rev c73e08a6dad46cad14b38a4a586a5cda1622b206) (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/net/TestNetworkTopology.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java
        Hide
        mingma Ming Ma added a comment -

        +1. Thanks Nanda kumar for the contribution. I have committed the patch to trunk and branch-2.

        Show
        mingma Ming Ma added a comment - +1. Thanks Nanda kumar for the contribution. I have committed the patch to trunk and branch-2.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 25s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        0 mvndep 3m 17s Maven dependency ordering for branch
        +1 mvninstall 7m 5s branch-2.8 passed
        +1 compile 6m 56s branch-2.8 passed with JDK v1.8.0_111
        +1 compile 7m 23s branch-2.8 passed with JDK v1.7.0_121
        +1 checkstyle 1m 7s branch-2.8 passed
        +1 mvnsite 1m 53s branch-2.8 passed
        +1 mvneclipse 0m 31s branch-2.8 passed
        +1 findbugs 3m 35s branch-2.8 passed
        +1 javadoc 1m 48s branch-2.8 passed with JDK v1.8.0_111
        +1 javadoc 2m 40s branch-2.8 passed with JDK v1.7.0_121
        0 mvndep 0m 15s Maven dependency ordering for patch
        +1 mvninstall 1m 35s the patch passed
        +1 compile 6m 17s the patch passed with JDK v1.8.0_111
        +1 javac 6m 17s the patch passed
        +1 compile 7m 2s the patch passed with JDK v1.7.0_121
        +1 javac 7m 2s the patch passed
        -0 checkstyle 1m 12s root: The patch generated 1 new + 132 unchanged - 2 fixed = 133 total (was 134)
        +1 mvnsite 1m 57s the patch passed
        +1 mvneclipse 0m 35s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 4m 5s the patch passed
        +1 javadoc 1m 41s the patch passed with JDK v1.8.0_111
        +1 javadoc 2m 38s the patch passed with JDK v1.7.0_121
        +1 unit 8m 58s hadoop-common in the patch passed with JDK v1.7.0_121.
        -1 unit 72m 44s hadoop-hdfs in the patch failed with JDK v1.7.0_121.
        +1 asflicense 0m 36s The patch does not generate ASF License warnings.
        222m 21s



        Reason Tests
        JDK v1.7.0_121 Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure
          hadoop.hdfs.TestFileCorruption



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:5af2af1
        JIRA Issue HDFS-10206
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12842113/HDFS-10206-branch-2.8.003.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 6c120829736f 3.13.0-96-generic #143-Ubuntu SMP Mon Aug 29 20:15:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision branch-2.8 / 879caf6
        Default Java 1.7.0_121
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_111 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_121
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/17785/artifact/patchprocess/diff-checkstyle-root.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/17785/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_121.txt
        JDK v1.7.0_121 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/17785/testReport/
        modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/17785/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 25s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 3m 17s Maven dependency ordering for branch +1 mvninstall 7m 5s branch-2.8 passed +1 compile 6m 56s branch-2.8 passed with JDK v1.8.0_111 +1 compile 7m 23s branch-2.8 passed with JDK v1.7.0_121 +1 checkstyle 1m 7s branch-2.8 passed +1 mvnsite 1m 53s branch-2.8 passed +1 mvneclipse 0m 31s branch-2.8 passed +1 findbugs 3m 35s branch-2.8 passed +1 javadoc 1m 48s branch-2.8 passed with JDK v1.8.0_111 +1 javadoc 2m 40s branch-2.8 passed with JDK v1.7.0_121 0 mvndep 0m 15s Maven dependency ordering for patch +1 mvninstall 1m 35s the patch passed +1 compile 6m 17s the patch passed with JDK v1.8.0_111 +1 javac 6m 17s the patch passed +1 compile 7m 2s the patch passed with JDK v1.7.0_121 +1 javac 7m 2s the patch passed -0 checkstyle 1m 12s root: The patch generated 1 new + 132 unchanged - 2 fixed = 133 total (was 134) +1 mvnsite 1m 57s the patch passed +1 mvneclipse 0m 35s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 4m 5s the patch passed +1 javadoc 1m 41s the patch passed with JDK v1.8.0_111 +1 javadoc 2m 38s the patch passed with JDK v1.7.0_121 +1 unit 8m 58s hadoop-common in the patch passed with JDK v1.7.0_121. -1 unit 72m 44s hadoop-hdfs in the patch failed with JDK v1.7.0_121. +1 asflicense 0m 36s The patch does not generate ASF License warnings. 222m 21s Reason Tests JDK v1.7.0_121 Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure   hadoop.hdfs.TestFileCorruption Subsystem Report/Notes Docker Image:yetus/hadoop:5af2af1 JIRA Issue HDFS-10206 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12842113/HDFS-10206-branch-2.8.003.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 6c120829736f 3.13.0-96-generic #143-Ubuntu SMP Mon Aug 29 20:15:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2.8 / 879caf6 Default Java 1.7.0_121 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_111 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/17785/artifact/patchprocess/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/17785/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_121.txt JDK v1.7.0_121 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/17785/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/17785/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        nandakumar131 Nanda kumar added a comment -

        Thanks for the review Ming Ma.
        Have uploaded the patch on top of branch-2.8 HDFS-10206-branch-2.8.003.patch

        Show
        nandakumar131 Nanda kumar added a comment - Thanks for the review Ming Ma . Have uploaded the patch on top of branch-2.8 HDFS-10206-branch-2.8.003.patch
        Hide
        mingma Ming Ma added a comment -

        Thanks Nanda kumar. The patch looks good. Given the patch doesn't apply directly for branch-2. Can you provide another patch for branch-2? You can use the naming convention for the branch-2 patch based on "Naming your patch" section in https://wiki.apache.org/hadoop/HowToContribute so that Jenkins can can run the precommit job.

        Show
        mingma Ming Ma added a comment - Thanks Nanda kumar . The patch looks good. Given the patch doesn't apply directly for branch-2. Can you provide another patch for branch-2? You can use the naming convention for the branch-2 patch based on "Naming your patch" section in https://wiki.apache.org/hadoop/HowToContribute so that Jenkins can can run the precommit job.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 27s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        0 mvndep 1m 45s Maven dependency ordering for branch
        +1 mvninstall 8m 24s trunk passed
        +1 compile 11m 0s trunk passed
        +1 checkstyle 1m 36s trunk passed
        +1 mvnsite 2m 11s trunk passed
        +1 mvneclipse 0m 43s trunk passed
        +1 findbugs 3m 22s trunk passed
        +1 javadoc 1m 41s trunk passed
        0 mvndep 0m 16s Maven dependency ordering for patch
        +1 mvninstall 1m 29s the patch passed
        +1 compile 9m 57s the patch passed
        +1 javac 9m 57s the patch passed
        -0 checkstyle 1m 40s root: The patch generated 1 new + 115 unchanged - 2 fixed = 116 total (was 117)
        +1 mvnsite 2m 5s the patch passed
        +1 mvneclipse 0m 39s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 3m 40s the patch passed
        +1 javadoc 1m 34s the patch passed
        -1 unit 9m 6s hadoop-common in the patch failed.
        -1 unit 92m 26s hadoop-hdfs in the patch failed.
        +1 asflicense 0m 32s The patch does not generate ASF License warnings.
        155m 45s



        Reason Tests
        Failed junit tests hadoop.net.TestDNS
          hadoop.ipc.TestIPC
          hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:a9ad5d6
        JIRA Issue HDFS-10206
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12841666/HDFS-10206.003.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 225a231f262d 3.13.0-96-generic #143-Ubuntu SMP Mon Aug 29 20:15:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / a7288da
        Default Java 1.8.0_111
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/17783/artifact/patchprocess/diff-checkstyle-root.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/17783/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/17783/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/17783/testReport/
        modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/17783/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 27s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 1m 45s Maven dependency ordering for branch +1 mvninstall 8m 24s trunk passed +1 compile 11m 0s trunk passed +1 checkstyle 1m 36s trunk passed +1 mvnsite 2m 11s trunk passed +1 mvneclipse 0m 43s trunk passed +1 findbugs 3m 22s trunk passed +1 javadoc 1m 41s trunk passed 0 mvndep 0m 16s Maven dependency ordering for patch +1 mvninstall 1m 29s the patch passed +1 compile 9m 57s the patch passed +1 javac 9m 57s the patch passed -0 checkstyle 1m 40s root: The patch generated 1 new + 115 unchanged - 2 fixed = 116 total (was 117) +1 mvnsite 2m 5s the patch passed +1 mvneclipse 0m 39s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 3m 40s the patch passed +1 javadoc 1m 34s the patch passed -1 unit 9m 6s hadoop-common in the patch failed. -1 unit 92m 26s hadoop-hdfs in the patch failed. +1 asflicense 0m 32s The patch does not generate ASF License warnings. 155m 45s Reason Tests Failed junit tests hadoop.net.TestDNS   hadoop.ipc.TestIPC   hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HDFS-10206 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12841666/HDFS-10206.003.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 225a231f262d 3.13.0-96-generic #143-Ubuntu SMP Mon Aug 29 20:15:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / a7288da Default Java 1.8.0_111 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/17783/artifact/patchprocess/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/17783/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/17783/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/17783/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/17783/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        nandakumar131 Nanda kumar added a comment - - edited

        Hi Ming Ma, Added logic to check for identical nodes in getWeightUsingNetworkLocation, now it will return 0 in case of identical nodes.
        Please review the HDFS-10206.003.patch

        Show
        nandakumar131 Nanda kumar added a comment - - edited Hi Ming Ma , Added logic to check for identical nodes in getWeightUsingNetworkLocation, now it will return 0 in case of identical nodes. Please review the HDFS-10206.003.patch
        Hide
        mingma Ming Ma added a comment -

        ok. Maybe it isn't precise way to refer it. The "network path" comes from NodeBase#getPath method. Anyway, the point is the new method should return 0 in case of two identical nodes.

        Show
        mingma Ming Ma added a comment - ok. Maybe it isn't precise way to refer it. The "network path" comes from NodeBase#getPath method. Anyway, the point is the new method should return 0 in case of two identical nodes.
        Hide
        nandakumar131 Nanda kumar added a comment -

        No, nodes of same network path refers to nodes on same rack. Node#getNetworkLocation() will return path to the Node (Node name is not included in the path)

        for a node "/dc1/rack1/datanode1", Node#getNetworkLocation() will return "/dc1/rack1"

        Show
        nandakumar131 Nanda kumar added a comment - No, nodes of same network path refers to nodes on same rack. Node#getNetworkLocation() will return path to the Node (Node name is not included in the path) for a node "/dc1/rack1/datanode1", Node#getNetworkLocation() will return "/dc1/rack1"
        Hide
        mingma Ming Ma added a comment -

        To clarify, "two nodes of the same network path" referred to two identical nodes, just like how getWeight could return 0 in such case.

        Show
        mingma Ming Ma added a comment - To clarify, "two nodes of the same network path" referred to two identical nodes, just like how getWeight could return 0 in such case.
        Hide
        nandakumar131 Nanda kumar added a comment -

        If we return 0 for two nodes having the same network path (i.e. in same rack) from getWeightUsingNetworkLocation

        • getWeightUsingNetworkLocation will return 0 for same rack
        • getWeight will return 2 for same rack

        It will be good to have same behavior across these methods.

        Show
        nandakumar131 Nanda kumar added a comment - If we return 0 for two nodes having the same network path (i.e. in same rack) from getWeightUsingNetworkLocation getWeightUsingNetworkLocation will return 0 for same rack getWeight will return 2 for same rack It will be good to have same behavior across these methods.
        Hide
        mingma Ming Ma added a comment -

        Thanks Nanda kumar! The patches look good overall. To make the method more general, seems better to have getWeightUsingNetworkLocation return 0 when two nodes have the same network path. Daryn Sharp Kihwal Lee, any concerns about the added 0.1ms latency? Note this only happens for non-datanode reader scenario and it doesn't hold FSNamesystem lock.

        Show
        mingma Ming Ma added a comment - Thanks Nanda kumar ! The patches look good overall. To make the method more general, seems better to have getWeightUsingNetworkLocation return 0 when two nodes have the same network path. Daryn Sharp Kihwal Lee , any concerns about the added 0.1ms latency? Note this only happens for non-datanode reader scenario and it doesn't hold FSNamesystem lock.
        Hide
        nandakumar131 Nanda kumar added a comment -

        nonDataNodeReader. However, it turns out NetworkTopology has several existing references of "datanode". So It is good to have and up to you if you want to fix it.

        I thought nonDataNodeReader is explicit and easy to understand, and as you mentioned there are several existing datanode reference in NetworkTopology.

        So 001.patch shouldn't has difference. Do you mind confirming?

        Below is the micro-benchmark for SameNode and SameRack with the HDFS-10206.002.patch

        Client on Run 1 Run 2 Run 3 Run 4 Run 5
        Same Node 126994 95364 140242 119920 113167
        DataNode in same rack 91442 124531 102606 104960 142946

        Can you confirm with 0002.patch the weights? It seems to return 0, 2, 4. The old behavior is 0, 1, 2.

        Yes, after the patch NetworkTopology.getWeight will return 0, 2, 4 ...

        Below are few cases

        Same Node:
        /rack1/datanode1
        /rack1/datanode1
        Will return: 0

        Same Rack:
        /rack1/datanode1
        /rack1/datanode2
        Will return: 2

        Different Rack:
        /rack1/datanode1
        /rack2/datanode3
        Will return: 4

        /dc1/rack1/datanode1
        /default-rack/datanode4
        Will return: 5

        /dc1/rack1/datanode1
        /dc2/rack5/datanode5
        Will return: 6

        Show
        nandakumar131 Nanda kumar added a comment - nonDataNodeReader. However, it turns out NetworkTopology has several existing references of "datanode". So It is good to have and up to you if you want to fix it. I thought nonDataNodeReader is explicit and easy to understand, and as you mentioned there are several existing datanode reference in NetworkTopology. So 001.patch shouldn't has difference. Do you mind confirming? Below is the micro-benchmark for SameNode and SameRack with the HDFS-10206.002.patch Client on Run 1 Run 2 Run 3 Run 4 Run 5 Same Node 126994 95364 140242 119920 113167 DataNode in same rack 91442 124531 102606 104960 142946 Can you confirm with 0002.patch the weights? It seems to return 0, 2, 4. The old behavior is 0, 1, 2. Yes, after the patch NetworkTopology.getWeight will return 0, 2, 4 ... Below are few cases Same Node: /rack1/datanode1 /rack1/datanode1 Will return: 0 Same Rack: /rack1/datanode1 /rack1/datanode2 Will return: 2 Different Rack: /rack1/datanode1 /rack2/datanode3 Will return: 4 /dc1/rack1/datanode1 /default-rack/datanode4 Will return: 5 /dc1/rack1/datanode1 /dc2/rack5/datanode5 Will return: 6
        Hide
        mingma Ming Ma added a comment -

        Can you point out the variables which are to be made more generic?

        nonDataNodeReader. However, it turns out NetworkTopology has several existing references of "datanode". So It is good to have and up to you if you want to fix it.

        With 000.patch the weight is calculated using network location for off rack datanodes which impacts the micro-benchmark results.

        Got it. Thanks for the clarification. So 001.patch shouldn't has difference. Do you mind confirming?

        Weight calculation after this patch

        Can you confirm with 0002.patch the weights? It seems to return 0, 2, 4. The old behavior is 0, 1, 2.

        Show
        mingma Ming Ma added a comment - Can you point out the variables which are to be made more generic? nonDataNodeReader. However, it turns out NetworkTopology has several existing references of "datanode". So It is good to have and up to you if you want to fix it. With 000.patch the weight is calculated using network location for off rack datanodes which impacts the micro-benchmark results. Got it. Thanks for the clarification. So 001.patch shouldn't has difference. Do you mind confirming? Weight calculation after this patch Can you confirm with 0002.patch the weights? It seems to return 0, 2, 4. The old behavior is 0, 1, 2.
        Hide
        nandakumar131 Nanda kumar added a comment -

        NetworkTopology can be used by HDFS, YARN and MAPREDUCE. It is better to make variable names more general.

        Can you point out the variables which are to be made more generic?

        But the reader should pick the closest one, either "Same Node" and "DataNode in same rack". Perhaps you can clarify the setup.

        Benchmarking was done with default replication factor - 3 (two replicas will be in same rack as the writer and one will be in a different rack datanode)
        NetworkTopology.sortByDistance method will call NetworkTopology.getWeight for every replica of the block (within activeLen). Out of three, at least one of the replica will be in off rack datanode even for "Same Node" and "DataNode in same rack". With 000.patch the weight is calculated using network location for off rack datanodes which impacts the micro-benchmark results.
        Sorry if I have confused you more.

        So the weight value definition has changed. It should be fine given it isn't a public interface. Still NetworkTopologyWithNodeGroup has its own getWeight definition based on the old definition. Either we update that or keep the weight value.

        According to NetworkTopologyWithNodeGroup.getWeight

        0 for same node
        1 for same group
        2 for same rack
        3 for off rack

        it aligns with weight definition of this patch, with an additional intermediate level (1 for same group)

        0 for same node
        2 for same rack

        for off rack in NetworkTopologyWithNodeGroup.getWeight we can call super.getWeight which will calculate the weight using new logic rather than returning 3 for all the off rack nodes.

        Show
        nandakumar131 Nanda kumar added a comment - NetworkTopology can be used by HDFS, YARN and MAPREDUCE. It is better to make variable names more general. Can you point out the variables which are to be made more generic? But the reader should pick the closest one, either "Same Node" and "DataNode in same rack". Perhaps you can clarify the setup. Benchmarking was done with default replication factor - 3 (two replicas will be in same rack as the writer and one will be in a different rack datanode) NetworkTopology.sortByDistance method will call NetworkTopology.getWeight for every replica of the block (within activeLen). Out of three, at least one of the replica will be in off rack datanode even for "Same Node" and "DataNode in same rack". With 000.patch the weight is calculated using network location for off rack datanodes which impacts the micro-benchmark results. Sorry if I have confused you more. So the weight value definition has changed. It should be fine given it isn't a public interface. Still NetworkTopologyWithNodeGroup has its own getWeight definition based on the old definition. Either we update that or keep the weight value. According to NetworkTopologyWithNodeGroup.getWeight 0 for same node 1 for same group 2 for same rack 3 for off rack it aligns with weight definition of this patch, with an additional intermediate level (1 for same group) 0 for same node 2 for same rack for off rack in NetworkTopologyWithNodeGroup.getWeight we can call super.getWeight which will calculate the weight using new logic rather than returning 3 for all the off rack nodes.
        Hide
        mingma Ming Ma added a comment -
        • NetworkTopology can be used by HDFS, YARN and MAPREDUCE. It is better to make variable names more general.

        Out of three replica, one will be in off rack datanode which is causing the difference

        But the reader should pick the closest one, either "Same Node" and "DataNode in same rack". Perhaps you can clarify the setup.

        Weight calculation after this patch

        So the weight value definition has changed. It should be fine given it isn't a public interface. Still NetworkTopologyWithNodeGroup has its own getWeight definition based on the old definition. Either we update that or keep the weight value.

        Show
        mingma Ming Ma added a comment - NetworkTopology can be used by HDFS, YARN and MAPREDUCE. It is better to make variable names more general. Out of three replica, one will be in off rack datanode which is causing the difference But the reader should pick the closest one, either "Same Node" and "DataNode in same rack". Perhaps you can clarify the setup. Weight calculation after this patch So the weight value definition has changed. It should be fine given it isn't a public interface. Still NetworkTopologyWithNodeGroup has its own getWeight definition based on the old definition. Either we update that or keep the weight value.
        Hide
        nandakumar131 Nanda kumar added a comment -

        Thanks for the comment Ming Ma

        Any idea why 000.patch makes difference for the "Same Node" and "DataNode in same rack"?

        Out of three replica, one will be in off rack datanode which is causing the difference.

        Based on the comment NetworkTopology.getWeightUsingNetworkLocation and NetworkTopology.normalizeNetworkLocationPath are changed to static, instead of calling NetworkTopology.getDistance from NetworkTopology.getWeight logic is added in getWeight to calculate the weight, which also takes care of isOnSameRack case.

        Weight calculation after this patch

        • 0 for same node
        • 2 for same rack
        • After that each level on each node increases the weight by 1

        Please review HDFS-10206.002.patch

        Show
        nandakumar131 Nanda kumar added a comment - Thanks for the comment Ming Ma Any idea why 000.patch makes difference for the "Same Node" and "DataNode in same rack"? Out of three replica, one will be in off rack datanode which is causing the difference. Based on the comment NetworkTopology.getWeightUsingNetworkLocation and NetworkTopology.normalizeNetworkLocationPath are changed to static, instead of calling NetworkTopology.getDistance from NetworkTopology.getWeight logic is added in getWeight to calculate the weight, which also takes care of isOnSameRack case. Weight calculation after this patch 0 for same node 2 for same rack After that each level on each node increases the weight by 1 Please review HDFS-10206.002.patch
        Hide
        mingma Ming Ma added a comment -

        Thanks Nandakumar for the micro benchmark and the new patch.

        • Any idea why 000.patch makes difference for the "Same Node" and "DataNode in same rack"?
        • In the context of the overall data transfer duration, the overhead of 0.1ms looks acceptable, especially given DatanodeManager#sortLocatedBlocks doesn't take FSNamesystem's lock.
        • It seems getWeightUsingNetworkLocation and normalizeNetworkLocationPath can be static.
        • getWeight function calls getDistance, which returns the distance between two nodes, not the weight defined as the distance between nodes and ancestors. Maybe we can define a new function like getDistanceToClosestCommonAncestor, which can also take care of the isOnSameRack case as well.
        • About ReadWriteLock.readLock, it might be ok given under normal workload there won't be much write to NetworkTopology.
        Show
        mingma Ming Ma added a comment - Thanks Nandakumar for the micro benchmark and the new patch. Any idea why 000.patch makes difference for the "Same Node" and "DataNode in same rack"? In the context of the overall data transfer duration, the overhead of 0.1ms looks acceptable, especially given DatanodeManager#sortLocatedBlocks doesn't take FSNamesystem's lock. It seems getWeightUsingNetworkLocation and normalizeNetworkLocationPath can be static. getWeight function calls getDistance, which returns the distance between two nodes, not the weight defined as the distance between nodes and ancestors. Maybe we can define a new function like getDistanceToClosestCommonAncestor, which can also take care of the isOnSameRack case as well. About ReadWriteLock.readLock, it might be ok given under normal workload there won't be much write to NetworkTopology.
        Hide
        nandakumar131 Nanda kumar added a comment -

        New HDFS-10206.001.patch has been uploaded.

        The logic is modifed based on Ming Ma's comment, using DatanodeManager.sortLocatedBlock which already knows if the reader is a datanode or not, to call appropriate method in NetworkTopology for sorting the blocks.

        Please review the patch.

        NetworkTopology.getWeight uses NetworkTopology.getDistance which has ReadWriteLock.readLock(), any thoughts on this?

        Show
        nandakumar131 Nanda kumar added a comment - New HDFS-10206.001.patch has been uploaded. The logic is modifed based on Ming Ma 's comment, using DatanodeManager.sortLocatedBlock which already knows if the reader is a datanode or not, to call appropriate method in NetworkTopology for sorting the blocks. Please review the patch. NetworkTopology.getWeight uses NetworkTopology.getDistance which has ReadWriteLock.readLock() , any thoughts on this?
        Hide
        nandakumar131 Nanda kumar added a comment -

        Here is another option. DatanodeManager#sortLocatedBlock already knows if its a datanode. So we can have a new NetworkTopology#sortByDistance that supports check-by-reference.

        This option looks better, thanks for the suggestion Ming Ma. Will make the changes accordingly and upload a new patch.

        In mean time I was able to do micro benchmark with HDFS-10206.000.patch, the benchmarks only measures the time taken by NetworkTopology.sortByDistance

        HDFS Version: 2.7.1
        Reading a single block file. (file size: 120 MB)
        The values are in nanosecond.

        Without patch

        Client on Run 1 Run 2 Run 3 Run 4 Run 5
        Same Node 99710 124014 134857 146936 111543
        DataNode in same rack 169122 99805 124058 134566 269096
        DataNode in different rack 114552 103003 153313 92008 114279
        Non-DataNode in same rack 97960 199611 77948 101324 90920
        Non-DataNode in different rack 93002 182436 104600 96434 138167

        With patch

        Client on Run 1 Run 2 Run 3 Run 4 Run 5
        Same Node 121510 185741 110382 180451 132131
        DataNode in same rack 182892 128597 187518 136754 385739
        DataNode in different rack 201029 274671 298843 146709 154405
        Non-DataNode in same rack 92687 182100 134704 277057 207532
        Non-DataNode in different rack 245957 115076 203657 181819 116314

        Below is the time taken by NetworkTopology.sortByDistance for a one GB file (eight blocks), the values are in nanosecond.

        Without patch

        Client on Run 1 Run 2 Run 3 Run 4 Run 5
        Non-DataNode in same rack 244535 282273 216524 4410825 339375

        With patch

        Client on Run 1 Run 2 Run 3 Run 4 Run 5
        Non-DataNode in same rack 729701 5801405 613048 655345 506294
        Show
        nandakumar131 Nanda kumar added a comment - Here is another option. DatanodeManager#sortLocatedBlock already knows if its a datanode. So we can have a new NetworkTopology#sortByDistance that supports check-by-reference. This option looks better, thanks for the suggestion Ming Ma . Will make the changes accordingly and upload a new patch. In mean time I was able to do micro benchmark with HDFS-10206.000.patch , the benchmarks only measures the time taken by NetworkTopology.sortByDistance HDFS Version: 2.7.1 Reading a single block file. (file size: 120 MB) The values are in nanosecond. Without patch Client on Run 1 Run 2 Run 3 Run 4 Run 5 Same Node 99710 124014 134857 146936 111543 DataNode in same rack 169122 99805 124058 134566 269096 DataNode in different rack 114552 103003 153313 92008 114279 Non-DataNode in same rack 97960 199611 77948 101324 90920 Non-DataNode in different rack 93002 182436 104600 96434 138167 With patch Client on Run 1 Run 2 Run 3 Run 4 Run 5 Same Node 121510 185741 110382 180451 132131 DataNode in same rack 182892 128597 187518 136754 385739 DataNode in different rack 201029 274671 298843 146709 154405 Non-DataNode in same rack 92687 182100 134704 277057 207532 Non-DataNode in different rack 245957 115076 203657 181819 116314 Below is the time taken by NetworkTopology.sortByDistance for a one GB file (eight blocks), the values are in nanosecond. Without patch Client on Run 1 Run 2 Run 3 Run 4 Run 5 Non-DataNode in same rack 244535 282273 216524 4410825 339375 With patch Client on Run 1 Run 2 Run 3 Run 4 Run 5 Non-DataNode in same rack 729701 5801405 613048 655345 506294
        Hide
        mingma Ming Ma added a comment -

        Any comments on using NetworkTopology.contains(node) to check and use NetworkTopology.getDistance(node1, node2) to get the distance in case if the reader is an off rack datanode?

        Here is another option. DatanodeManager#sortLocatedBlock already knows if its a datanode. So we can have a new NetworkTopology#sortByDistance that supports check-by-reference.

        Show
        mingma Ming Ma added a comment - Any comments on using NetworkTopology.contains(node) to check and use NetworkTopology.getDistance(node1, node2) to get the distance in case if the reader is an off rack datanode? Here is another option. DatanodeManager#sortLocatedBlock already knows if its a datanode. So we can have a new NetworkTopology#sortByDistance that supports check-by-reference.
        Hide
        nandakumar131 Nanda kumar added a comment -

        From the below code, it seems each level will increase by 2.

        Yes, you're right. Distance from a node to it's parent is assumed to be 1, so adding the value of both the nodes will make it 2.
        Sorry for the confusion on that.

        one is the reader being a datanode in a remote rack in a large cluster; for that NetworkTopology already has the reader in its tree, it will be faster to compare parents reference.

        We can use NetworkTopology.contains(node) to check if the reader is a datanode and use NetworkTopology.getDistance(node1, node2) to get the distance (which also calculates the distance by summing up the nodes distances to their closest common ancestor), but both of these methods use ReadWriteLock.readLock() which might again impact the performance.

        Any comments on using NetworkTopology.contains(node) to check and use NetworkTopology.getDistance(node1, node2) to get the distance in case if the reader is an off rack datanode?

        I'm currently working on the benchmarking, will update it once it's done.

        Show
        nandakumar131 Nanda kumar added a comment - From the below code, it seems each level will increase by 2. Yes, you're right. Distance from a node to it's parent is assumed to be 1, so adding the value of both the nodes will make it 2. Sorry for the confusion on that. one is the reader being a datanode in a remote rack in a large cluster; for that NetworkTopology already has the reader in its tree, it will be faster to compare parents reference. We can use NetworkTopology.contains(node) to check if the reader is a datanode and use NetworkTopology.getDistance(node1, node2) to get the distance (which also calculates the distance by summing up the nodes distances to their closest common ancestor), but both of these methods use ReadWriteLock.readLock() which might again impact the performance. Any comments on using NetworkTopology.contains(node) to check and use NetworkTopology.getDistance(node1, node2) to get the distance in case if the reader is an off rack datanode? I'm currently working on the benchmarking, will update it once it's done.
        Hide
        mingma Ming Ma added a comment -

        that is why getDistanceUsingNetworkLocation is called only when the conditions reader.equals(node) and isOnSameRack(reader, node) are not satisfied.

        There are two scenarios this new function will be called. one is the reader being a datanode in a remote rack in a large cluster; for that NetworkTopology already has the reader in its tree, it will be faster to compare parents reference. Another one is the reader being a non-datanode, the new function will be useful here. Do you have any micro benchmark?

        With this patch it will be 0 for local, 1 for same rack and after that the value is incremented by 1 for each level.

        From the below code, it seems each level will increase by 2.

              weight = (path1Token.length - currentLevel) +
                  (path2Token.length - currentLevel);
        
        Show
        mingma Ming Ma added a comment - that is why getDistanceUsingNetworkLocation is called only when the conditions reader.equals(node) and isOnSameRack(reader, node) are not satisfied. There are two scenarios this new function will be called. one is the reader being a datanode in a remote rack in a large cluster; for that NetworkTopology already has the reader in its tree, it will be faster to compare parents reference. Another one is the reader being a non-datanode, the new function will be useful here. Do you have any micro benchmark? With this patch it will be 0 for local, 1 for same rack and after that the value is incremented by 1 for each level. From the below code, it seems each level will increase by 2. weight = (path1Token.length - currentLevel) + (path2Token.length - currentLevel);
        Hide
        nandakumar131 Nanda kumar added a comment -

        Thanks for the review Ming Ma.

        When the conditions {{reader.equals(node) & isOnSameRack(reader, node) }} aren't satisfied, this patch will cause extra string parsing. Wonder if there is any major performance impact. If that isn't an issue, can getDistanceUsingNetworkLocation handle all scenarios including {{reader.equals(node) & isOnSameRack(reader, node) }}?

        I was also worried about the performance impact that will be caused by extra string parsing, that is why getDistanceUsingNetworkLocation is called only when the conditions reader.equals(node) and isOnSameRact(reader, node) are not satisfied.

        It probably doesn't matter much. getWeight used to return 0, 1, 2, 3, etc. as network layer increases. With the patch it changes to 0, 1, 2, 4, etc..

        I didn't quite understand this point. Previously getWeight used to return 0 for local, 1 for same rack and 2 for off rack. With this patch it will be 0 for local, 1 for same rack and after that the value is incremented by 1 for each level

        Show
        nandakumar131 Nanda kumar added a comment - Thanks for the review Ming Ma . When the conditions {{reader.equals(node) & isOnSameRack(reader, node) }} aren't satisfied, this patch will cause extra string parsing. Wonder if there is any major performance impact. If that isn't an issue, can getDistanceUsingNetworkLocation handle all scenarios including {{reader.equals(node) & isOnSameRack(reader, node) }}? I was also worried about the performance impact that will be caused by extra string parsing, that is why getDistanceUsingNetworkLocation is called only when the conditions reader.equals(node) and isOnSameRact(reader, node) are not satisfied. It probably doesn't matter much. getWeight used to return 0, 1, 2, 3, etc. as network layer increases. With the patch it changes to 0, 1, 2, 4, etc.. I didn't quite understand this point. Previously getWeight used to return 0 for local, 1 for same rack and 2 for off rack. With this patch it will be 0 for local, 1 for same rack and after that the value is incremented by 1 for each level
        Hide
        mingma Ming Ma added a comment -

        Thank Nanda kumar!

        • When the conditions {{reader.equals(node) & isOnSameRack(reader, node) }} aren't satisfied, this patch will cause extra string parsing. Wonder if there is any major performance impact. If that isn't an issue, can getDistanceUsingNetworkLocation handle all scenarios including {{reader.equals(node) & isOnSameRack(reader, node) }}?
        • It probably doesn't matter much. getWeight used to return 0, 1, 2, 3, etc. as network layer increases. With the patch it changes to 0, 1, 2, 4, etc..
        Show
        mingma Ming Ma added a comment - Thank Nanda kumar ! When the conditions {{reader.equals(node) & isOnSameRack(reader, node) }} aren't satisfied, this patch will cause extra string parsing. Wonder if there is any major performance impact. If that isn't an issue, can getDistanceUsingNetworkLocation handle all scenarios including {{reader.equals(node) & isOnSameRack(reader, node) }}? It probably doesn't matter much. getWeight used to return 0, 1, 2, 3, etc. as network layer increases. With the patch it changes to 0, 1, 2, 4, etc..
        Hide
        nandakumar131 Nanda kumar added a comment -

        NetworkTopology#sortByDistance uses NetworkTopology#getWeight to calculate the distance between reader and node. Additional logic is added in NetworkTopology#getWeight to calculate the distance based on networkLocation of reader and the node when the following conditions are not satisfy

        reader.equals(node) & isOnSameRack(reader, node)

        This will work for DFSClient machine which is not a datanode, since the distance calculation depends on networkLocation and not the parent Node.

        Please review the patch.

        Thanks,
        Nanda

        Show
        nandakumar131 Nanda kumar added a comment - NetworkTopology#sortByDistance uses NetworkTopology#getWeight to calculate the distance between reader and node. Additional logic is added in NetworkTopology#getWeight to calculate the distance based on networkLocation of reader and the node when the following conditions are not satisfy reader.equals(node) & isOnSameRack(reader, node) This will work for DFSClient machine which is not a datanode, since the distance calculation depends on networkLocation and not the parent Node. Please review the patch. Thanks, Nanda

          People

          • Assignee:
            nandakumar131 Nanda kumar
            Reporter:
            mingma Ming Ma
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development