Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10335

Mover$Processor#chooseTarget() always chooses the first matching target storage group

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Currently the org.apache.hadoop.hdfs.server.mover.Mover$Processor#chooseTarget() always chooses the first matching target datanode from the candidate list. This may make the mover schedule a lot of task to a few of the datanodes (first several datanodes of the candidate list). The overall performance will suffer significantly from this because of the saturated network/disk usage. Specially, if the dfs.datanode.balance.max.concurrent.moves is set, the scheduled move task will be queued on a few of the storage group, regardless of other available storage groups. We need an algorithm which can distribute the move tasks approximately even across all the candidate target storage groups.

      Thanks Tsz Wo Nicholas Sze for offline discussion.

      1. HDFS-10335.000.patch
        1 kB
        Mingliang Liu
      2. HDFS-10335.000.patch
        1 kB
        Mingliang Liu

        Activity

        Hide
        liuml07 Mingliang Liu added a comment -

        The code is as following:

            boolean chooseTarget(DBlock db, Source source,
                List<StorageType> targetTypes, Matcher matcher) {
              final NetworkTopology cluster = dispatcher.getCluster(); 
              for (StorageType t : targetTypes) {
                for(StorageGroup target : storages.getTargetStorages(t)) {
                  if (matcher.match(cluster, source.getDatanodeInfo(),
                      target.getDatanodeInfo())) {
                    final PendingMove pm = source.addPendingMove(db, target);
                    if (pm != null) {
                      dispatcher.executePendingMove(pm);
                      return true;
                    }
                  }
                }
              }
              return false;
            }
          }
        

        To address this, we can pick a random matching storage group for the given storage type. One implementation is to shuffle the candidate target storages before iterating them. Will post a patch shortly.

        Show
        liuml07 Mingliang Liu added a comment - The code is as following: boolean chooseTarget(DBlock db, Source source, List<StorageType> targetTypes, Matcher matcher) { final NetworkTopology cluster = dispatcher.getCluster(); for (StorageType t : targetTypes) { for (StorageGroup target : storages.getTargetStorages(t)) { if (matcher.match(cluster, source.getDatanodeInfo(), target.getDatanodeInfo())) { final PendingMove pm = source.addPendingMove(db, target); if (pm != null ) { dispatcher.executePendingMove(pm); return true ; } } } } return false ; } } To address this, we can pick a random matching storage group for the given storage type. One implementation is to shuffle the candidate target storages before iterating them. Will post a patch shortly.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 0s Docker mode activated.
        -1 docker 0m 3s Docker failed to build yetus/hadoop:7b1c37a.



        Subsystem Report/Notes
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12800935/HDFS-10335.000.patch
        JIRA Issue HDFS-10335
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15309/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 docker 0m 3s Docker failed to build yetus/hadoop:7b1c37a. Subsystem Report/Notes JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12800935/HDFS-10335.000.patch JIRA Issue HDFS-10335 Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15309/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        szetszwo Tsz Wo Nicholas Sze added a comment -

        +1 patch looks good.

        Show
        szetszwo Tsz Wo Nicholas Sze added a comment - +1 patch looks good.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 0s Docker mode activated.
        -1 docker 0m 4s Docker failed to build yetus/hadoop:7b1c37a.



        Subsystem Report/Notes
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12800935/HDFS-10335.000.patch
        JIRA Issue HDFS-10335
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15310/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 docker 0m 4s Docker failed to build yetus/hadoop:7b1c37a. Subsystem Report/Notes JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12800935/HDFS-10335.000.patch JIRA Issue HDFS-10335 Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15310/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        liuml07 Mingliang Liu added a comment -
        Step 16 : RUN cabal update && cabal install shellcheck --global
         ---> Running in 5438b8eb4d37
        Config file path source is default config file.
        Config file /root/.cabal/config not found.
        Writing default configuration to /root/.cabal/config
        Downloading the latest package list from hackage.haskell.org
        cabal: Failed to download
        http://hackage.haskell.org/packages/archive/00-index.tar.gz : ErrorMisc
        "Unsucessful HTTP code: 502"
        The command '/bin/sh -c cabal update && cabal install shellcheck --global' returned a non-zero code: 1
        
        Total Elapsed time:   0m  4s
        
        ERROR: Docker failed to build image.
        

        It seems the Yetus is not happy, but not Jenkins.

        Show
        liuml07 Mingliang Liu added a comment - Step 16 : RUN cabal update && cabal install shellcheck --global ---> Running in 5438b8eb4d37 Config file path source is default config file. Config file /root/.cabal/config not found. Writing default configuration to /root/.cabal/config Downloading the latest package list from hackage.haskell.org cabal: Failed to download http: //hackage.haskell.org/packages/archive/00-index.tar.gz : ErrorMisc "Unsucessful HTTP code: 502" The command '/bin/sh -c cabal update && cabal install shellcheck --global' returned a non-zero code: 1 Total Elapsed time: 0m 4s ERROR: Docker failed to build image. It seems the Yetus is not happy, but not Jenkins.
        Hide
        liuml07 Mingliang Liu added a comment -

        Thanks for your discussion and review, Tsz Wo Nicholas Sze.

        By the way, we did not see any failing UT locally, and let's pend on Jenkins to verify. Meanwhile, we're testing the patch manually on a local cluster.

        Show
        liuml07 Mingliang Liu added a comment - Thanks for your discussion and review, Tsz Wo Nicholas Sze . By the way, we did not see any failing UT locally, and let's pend on Jenkins to verify. Meanwhile, we're testing the patch manually on a local cluster.
        Hide
        liuml07 Mingliang Liu added a comment -

        Re-uploading the same patch to trigger Jenkins.

        Show
        liuml07 Mingliang Liu added a comment - Re-uploading the same patch to trigger Jenkins.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 12s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 6m 42s trunk passed
        +1 compile 0m 38s trunk passed with JDK v1.8.0_92
        +1 compile 0m 41s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 21s trunk passed
        +1 mvnsite 0m 49s trunk passed
        +1 mvneclipse 0m 13s trunk passed
        +1 findbugs 1m 59s trunk passed
        +1 javadoc 1m 5s trunk passed with JDK v1.8.0_92
        +1 javadoc 1m 46s trunk passed with JDK v1.7.0_95
        +1 mvninstall 0m 45s the patch passed
        +1 compile 0m 38s the patch passed with JDK v1.8.0_92
        +1 javac 0m 38s the patch passed
        +1 compile 0m 39s the patch passed with JDK v1.7.0_95
        +1 javac 0m 39s the patch passed
        +1 checkstyle 0m 18s the patch passed
        +1 mvnsite 0m 48s the patch passed
        +1 mvneclipse 0m 11s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 2m 8s the patch passed
        +1 javadoc 1m 1s the patch passed with JDK v1.8.0_92
        +1 javadoc 1m 46s the patch passed with JDK v1.7.0_95
        -1 unit 59m 22s hadoop-hdfs in the patch failed with JDK v1.8.0_92.
        -1 unit 53m 56s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
        +1 asflicense 0m 23s Patch does not generate ASF License warnings.
        138m 20s



        Reason Tests
        JDK v1.8.0_92 Failed junit tests hadoop.hdfs.TestFileAppend
          hadoop.hdfs.server.balancer.TestBalancer
          hadoop.hdfs.TestCrcCorruption
          hadoop.hdfs.shortcircuit.TestShortCircuitCache
        JDK v1.7.0_95 Failed junit tests hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl
          hadoop.hdfs.TestRollingUpgradeRollback



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:cf2ee45
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12801344/HDFS-10335.000.patch
        JIRA Issue HDFS-10335
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 9269c005acf8 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 6243eab
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_92 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15319/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_92.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15319/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15319/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_92.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15319/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15319/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15319/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 12s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 42s trunk passed +1 compile 0m 38s trunk passed with JDK v1.8.0_92 +1 compile 0m 41s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 21s trunk passed +1 mvnsite 0m 49s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 1m 59s trunk passed +1 javadoc 1m 5s trunk passed with JDK v1.8.0_92 +1 javadoc 1m 46s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 45s the patch passed +1 compile 0m 38s the patch passed with JDK v1.8.0_92 +1 javac 0m 38s the patch passed +1 compile 0m 39s the patch passed with JDK v1.7.0_95 +1 javac 0m 39s the patch passed +1 checkstyle 0m 18s the patch passed +1 mvnsite 0m 48s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 2m 8s the patch passed +1 javadoc 1m 1s the patch passed with JDK v1.8.0_92 +1 javadoc 1m 46s the patch passed with JDK v1.7.0_95 -1 unit 59m 22s hadoop-hdfs in the patch failed with JDK v1.8.0_92. -1 unit 53m 56s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 23s Patch does not generate ASF License warnings. 138m 20s Reason Tests JDK v1.8.0_92 Failed junit tests hadoop.hdfs.TestFileAppend   hadoop.hdfs.server.balancer.TestBalancer   hadoop.hdfs.TestCrcCorruption   hadoop.hdfs.shortcircuit.TestShortCircuitCache JDK v1.7.0_95 Failed junit tests hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl   hadoop.hdfs.TestRollingUpgradeRollback Subsystem Report/Notes Docker Image:yetus/hadoop:cf2ee45 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12801344/HDFS-10335.000.patch JIRA Issue HDFS-10335 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 9269c005acf8 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 6243eab Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_92 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/15319/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_92.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15319/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15319/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_92.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15319/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15319/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15319/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        liuml07 Mingliang Liu added a comment -

        Failing tests are not related. Specially, hadoop.hdfs.TestRollingUpgradeRollback fails because of port in use. hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl is a known bug which is tracked by HDFS-10260.

        We did not add new test as the code path is covered by existing tests. We manually tested the patch and the Mover was ~60X faster than before, though it's not a general case as all its ARCHIVE datanodes are newly added to the same rack.

        Show
        liuml07 Mingliang Liu added a comment - Failing tests are not related. Specially, hadoop.hdfs.TestRollingUpgradeRollback fails because of port in use. hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl is a known bug which is tracked by HDFS-10260 . We did not add new test as the code path is covered by existing tests. We manually tested the patch and the Mover was ~60X faster than before, though it's not a general case as all its ARCHIVE datanodes are newly added to the same rack.
        Hide
        szetszwo Tsz Wo Nicholas Sze added a comment -

        I have committed this. Thanks, Mingliang!

        Show
        szetszwo Tsz Wo Nicholas Sze added a comment - I have committed this. Thanks, Mingliang!
        Hide
        liuml07 Mingliang Liu added a comment -

        Thanks for your review and commit, Tsz Wo Nicholas Sze.

        Show
        liuml07 Mingliang Liu added a comment - Thanks for your review and commit, Tsz Wo Nicholas Sze .
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #9695 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9695/)
        HDFS-10335 Mover$Processor#chooseTarget() always chooses the first (szetszwo: rev 4da6f69ca129b28a5dad0a66d0c24e725ce25a3a)

        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/mover/Mover.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9695 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9695/ ) HDFS-10335 Mover$Processor#chooseTarget() always chooses the first (szetszwo: rev 4da6f69ca129b28a5dad0a66d0c24e725ce25a3a) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/mover/Mover.java
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Closing the JIRA as part of 2.7.3 release.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Closing the JIRA as part of 2.7.3 release.

          People

          • Assignee:
            liuml07 Mingliang Liu
            Reporter:
            liuml07 Mingliang Liu
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development