Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5008

LeveldbRMStateStore database can grow substantially leading to long recovery times

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      On large clusters with high application churn the background compaction in leveldb may not be able to keep up with the write rate. This can lead to large leveldb databases that take many minutes to recover despite not having very much real data in the database to load. Most the time is spent traversing tables full of keys that have been deleted.

        Activity

        Hide
        jlowe Jason Lowe added a comment -

        I noticed that in the cases where the database was quite large a manual compaction of the database would shrink it from many gigabytes to under 100MB. It looks like we need periodic manual compactions of the database to keep the leveldb tables from getting filled up with stale keys. Once the database fills with mostly stale keys the recovery process becomes quite slow due to all the I/O required to iterate the few valid keys remaining.

        Attaching a patch that adds a periodic full compaction of the database. By default it runs every hour, but the interval can be configured or even disabled if desired. I did some tests on a very large database writing keys every 10msec while a full compaction cycle was running, and the impact to the write performance was acceptable. Writes were occasionally delayed by up to 30% due to disk I/O contention, but overall the write performance was still quite good. If the database is already mostly compact the cycle runs very fast, so this should have minimal impact on the overall RM state store performance.

        Show
        jlowe Jason Lowe added a comment - I noticed that in the cases where the database was quite large a manual compaction of the database would shrink it from many gigabytes to under 100MB. It looks like we need periodic manual compactions of the database to keep the leveldb tables from getting filled up with stale keys. Once the database fills with mostly stale keys the recovery process becomes quite slow due to all the I/O required to iterate the few valid keys remaining. Attaching a patch that adds a periodic full compaction of the database. By default it runs every hour, but the interval can be configured or even disabled if desired. I did some tests on a very large database writing keys every 10msec while a full compaction cycle was running, and the impact to the write performance was acceptable. Writes were occasionally delayed by up to 30% due to disk I/O contention, but overall the write performance was still quite good. If the database is already mostly compact the cycle runs very fast, so this should have minimal impact on the overall RM state store performance.
        Hide
        nroberts Nathan Roberts added a comment -

        Thanks for the patch. LGTM. +1 non-binding

        Show
        nroberts Nathan Roberts added a comment - Thanks for the patch. LGTM. +1 non-binding
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 16s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        0 mvndep 0m 11s Maven dependency ordering for branch
        +1 mvninstall 6m 43s trunk passed
        +1 compile 1m 44s trunk passed with JDK v1.8.0_92
        +1 compile 2m 8s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 34s trunk passed
        +1 mvnsite 1m 33s trunk passed
        +1 mvneclipse 0m 39s trunk passed
        -1 findbugs 1m 8s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in trunk has 1 extant Findbugs warnings.
        +1 javadoc 1m 25s trunk passed with JDK v1.8.0_92
        +1 javadoc 3m 55s trunk passed with JDK v1.7.0_95
        0 mvndep 0m 11s Maven dependency ordering for patch
        +1 mvninstall 1m 19s the patch passed
        +1 compile 1m 44s the patch passed with JDK v1.8.0_92
        +1 javac 1m 44s the patch passed
        +1 compile 2m 6s the patch passed with JDK v1.7.0_95
        +1 javac 2m 6s the patch passed
        +1 checkstyle 0m 33s the patch passed
        +1 mvnsite 1m 28s the patch passed
        +1 mvneclipse 0m 36s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 xml 0m 1s The patch has no ill-formed XML file.
        +1 findbugs 3m 56s the patch passed
        +1 javadoc 1m 20s the patch passed with JDK v1.8.0_92
        +1 javadoc 3m 53s the patch passed with JDK v1.7.0_95
        +1 unit 0m 19s hadoop-yarn-api in the patch passed with JDK v1.8.0_92.
        +1 unit 1m 58s hadoop-yarn-common in the patch passed with JDK v1.8.0_92.
        -1 unit 48m 33s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_92.
        +1 unit 0m 22s hadoop-yarn-api in the patch passed with JDK v1.7.0_95.
        +1 unit 2m 12s hadoop-yarn-common in the patch passed with JDK v1.7.0_95.
        -1 unit 49m 17s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95.
        +1 asflicense 0m 17s Patch does not generate ASF License warnings.
        144m 0s



        Reason Tests
        JDK v1.8.0_92 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
          hadoop.yarn.server.resourcemanager.TestAMAuthorization
        JDK v1.8.0_92 Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes
        JDK v1.7.0_95 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
          hadoop.yarn.server.resourcemanager.TestAMAuthorization
          hadoop.yarn.server.resourcemanager.TestContainerResourceUsage
        JDK v1.7.0_95 Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:7b1c37a
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12801248/YARN-5008.001.patch
        JIRA Issue YARN-5008
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml
        uname Linux adff080847ba 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 6f26b66
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_92 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        findbugs https://builds.apache.org/job/PreCommit-YARN-Build/11261/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common-warnings.html
        unit https://builds.apache.org/job/PreCommit-YARN-Build/11261/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_92.txt
        unit https://builds.apache.org/job/PreCommit-YARN-Build/11261/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95.txt
        unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/11261/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_92.txt https://builds.apache.org/job/PreCommit-YARN-Build/11261/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95.txt
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11261/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/11261/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 0m 11s Maven dependency ordering for branch +1 mvninstall 6m 43s trunk passed +1 compile 1m 44s trunk passed with JDK v1.8.0_92 +1 compile 2m 8s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 34s trunk passed +1 mvnsite 1m 33s trunk passed +1 mvneclipse 0m 39s trunk passed -1 findbugs 1m 8s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in trunk has 1 extant Findbugs warnings. +1 javadoc 1m 25s trunk passed with JDK v1.8.0_92 +1 javadoc 3m 55s trunk passed with JDK v1.7.0_95 0 mvndep 0m 11s Maven dependency ordering for patch +1 mvninstall 1m 19s the patch passed +1 compile 1m 44s the patch passed with JDK v1.8.0_92 +1 javac 1m 44s the patch passed +1 compile 2m 6s the patch passed with JDK v1.7.0_95 +1 javac 2m 6s the patch passed +1 checkstyle 0m 33s the patch passed +1 mvnsite 1m 28s the patch passed +1 mvneclipse 0m 36s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 xml 0m 1s The patch has no ill-formed XML file. +1 findbugs 3m 56s the patch passed +1 javadoc 1m 20s the patch passed with JDK v1.8.0_92 +1 javadoc 3m 53s the patch passed with JDK v1.7.0_95 +1 unit 0m 19s hadoop-yarn-api in the patch passed with JDK v1.8.0_92. +1 unit 1m 58s hadoop-yarn-common in the patch passed with JDK v1.8.0_92. -1 unit 48m 33s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_92. +1 unit 0m 22s hadoop-yarn-api in the patch passed with JDK v1.7.0_95. +1 unit 2m 12s hadoop-yarn-common in the patch passed with JDK v1.7.0_95. -1 unit 49m 17s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 17s Patch does not generate ASF License warnings. 144m 0s Reason Tests JDK v1.8.0_92 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestAMAuthorization JDK v1.8.0_92 Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes JDK v1.7.0_95 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestAMAuthorization   hadoop.yarn.server.resourcemanager.TestContainerResourceUsage JDK v1.7.0_95 Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes Subsystem Report/Notes Docker Image:yetus/hadoop:7b1c37a JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12801248/YARN-5008.001.patch JIRA Issue YARN-5008 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml uname Linux adff080847ba 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 6f26b66 Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_92 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 findbugs https://builds.apache.org/job/PreCommit-YARN-Build/11261/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common-warnings.html unit https://builds.apache.org/job/PreCommit-YARN-Build/11261/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_92.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/11261/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/11261/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_92.txt https://builds.apache.org/job/PreCommit-YARN-Build/11261/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11261/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn Console output https://builds.apache.org/job/PreCommit-YARN-Build/11261/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        jianhe Jian He added a comment -

        lgtm, thanks Jason, will commit later today

        Show
        jianhe Jian He added a comment - lgtm, thanks Jason, will commit later today
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #9689 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9689/)
        YARN-5008. LeveldbRMStateStore database can grow substantially leading (jianhe: rev dd80042c42aadaa347db93028724f69c9aca69c6)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestLeveldbRMStateStore.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/LeveldbRMStateStore.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9689 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9689/ ) YARN-5008 . LeveldbRMStateStore database can grow substantially leading (jianhe: rev dd80042c42aadaa347db93028724f69c9aca69c6) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestLeveldbRMStateStore.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/LeveldbRMStateStore.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
        Hide
        jianhe Jian He added a comment -

        Committed to trunk, branch-2, branch-2.8, branch-2.7 thanks Jason !
        Thanks Nathan Roberts for the review !

        Show
        jianhe Jian He added a comment - Committed to trunk, branch-2, branch-2.8, branch-2.7 thanks Jason ! Thanks Nathan Roberts for the review !
        Hide
        kasha Karthik Kambatla added a comment -

        By the way, this also brings up another issue with our current RMStateStore implementations. We don't need to retain all the information for completed applications; for instance, the ASC is moot but contributes significantly to the statestore-footprint. This affects all stores, and the ZK store in particular.

        Show
        kasha Karthik Kambatla added a comment - By the way, this also brings up another issue with our current RMStateStore implementations. We don't need to retain all the information for completed applications; for instance, the ASC is moot but contributes significantly to the statestore-footprint. This affects all stores, and the ZK store in particular.
        Hide
        eepayne Eric Payne added a comment -

        Changing Fix Version to 2.7.3 since branch 2.7.3 has not yet been created.

        Show
        eepayne Eric Payne added a comment - Changing Fix Version to 2.7.3 since branch 2.7.3 has not yet been created.
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Closing the JIRA as part of 2.7.3 release.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Closing the JIRA as part of 2.7.3 release.

          People

          • Assignee:
            jlowe Jason Lowe
            Reporter:
            jlowe Jason Lowe
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development