Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6354

LeveldbRMStateStore can parse invalid keys when recovering reservations

    Details

    • Target Version/s:

      Description

      When trying to upgrade an RM to 2.8 it fails with a StringIndexOutOfBoundsException trying to load reservation state.

        Issue Links

          Activity

          Hide
          jlowe Jason Lowe added a comment -

          Sample stacktrace:

          2017-03-16 15:17:26,616 INFO  [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state STARTED; cause: java.lang.StringIndexOutOfBoundsException: String index out of range: -17
          java.lang.StringIndexOutOfBoundsException: String index out of range: -17
          	at java.lang.String.substring(String.java:1931)
          	at org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore.loadReservationState(LeveldbRMStateStore.java:289)
          	at org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore.loadState(LeveldbRMStateStore.java:274)
          	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:690)
          	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
          	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1097)
          	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1137)
          	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1133)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:422)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
          	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1133)
          	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1173)
          	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
          	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1338)
          

          This was broken by YARN-3736. The recovery code is seeking to the RM_RESERVATION_KEY_PREFIX but failing to verify that the keys it sees in the loop actually have that key prefix. Here's the relevant code:

                iter = new LeveldbIterator(db);
                iter.seek(bytes(RM_RESERVATION_KEY_PREFIX));
                while (iter.hasNext()) {
                  Entry<byte[],byte[]> entry = iter.next();
                  String key = asString(entry.getKey());
          
                  String planReservationString =
                      key.substring(RM_RESERVATION_KEY_PREFIX.length());
                  String[] parts = planReservationString.split(SEPARATOR);
                  if (parts.length != 2) {
                    LOG.warn("Incorrect reservation state key " + key);
                    continue;
                  }
          

          The only way to terminate this loop is when the iterator runs out of keys, therefore the iteration loop will scan through all the keys in the database starting at the reservation key to the end. If any key encountered is too short then we'll get the out of bounds exception when we try to do the substring.

          Pinging Anubhav Dhoot and Arun Suresh who were involved in YARN-3736.

          Show
          jlowe Jason Lowe added a comment - Sample stacktrace: 2017-03-16 15:17:26,616 INFO [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state STARTED; cause: java.lang.StringIndexOutOfBoundsException: String index out of range: -17 java.lang.StringIndexOutOfBoundsException: String index out of range: -17 at java.lang.String.substring(String.java:1931) at org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore.loadReservationState(LeveldbRMStateStore.java:289) at org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore.loadState(LeveldbRMStateStore.java:274) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:690) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1097) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1137) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1133) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1133) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1173) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1338) This was broken by YARN-3736 . The recovery code is seeking to the RM_RESERVATION_KEY_PREFIX but failing to verify that the keys it sees in the loop actually have that key prefix. Here's the relevant code: iter = new LeveldbIterator(db); iter.seek(bytes(RM_RESERVATION_KEY_PREFIX)); while (iter.hasNext()) { Entry< byte [], byte []> entry = iter.next(); String key = asString(entry.getKey()); String planReservationString = key.substring(RM_RESERVATION_KEY_PREFIX.length()); String [] parts = planReservationString.split(SEPARATOR); if (parts.length != 2) { LOG.warn( "Incorrect reservation state key " + key); continue ; } The only way to terminate this loop is when the iterator runs out of keys, therefore the iteration loop will scan through all the keys in the database starting at the reservation key to the end. If any key encountered is too short then we'll get the out of bounds exception when we try to do the substring. Pinging Anubhav Dhoot and Arun Suresh who were involved in YARN-3736 .
          Hide
          jlowe Jason Lowe added a comment -

          I found another instance where a rolling upgrade to 2.8 with leveldb did work successfully, so I dug a bit deeper into why this doesn't always fail. It turns out that normally the reservation state keys happen to be the last keys in the database and therefore it works. If the database happens to have any relatively short keys after the reservation keys then it breaks. My local dev database had some short, lowercase keys leftover in it from some prior work, and that's how I ran into the issue.

          Since it looks like this happens to not be a problem for now with "normal" RM leveldb databases I lowered the severity and updated the headline accordingly.

          Show
          jlowe Jason Lowe added a comment - I found another instance where a rolling upgrade to 2.8 with leveldb did work successfully, so I dug a bit deeper into why this doesn't always fail. It turns out that normally the reservation state keys happen to be the last keys in the database and therefore it works. If the database happens to have any relatively short keys after the reservation keys then it breaks. My local dev database had some short, lowercase keys leftover in it from some prior work, and that's how I ran into the issue. Since it looks like this happens to not be a problem for now with "normal" RM leveldb databases I lowered the severity and updated the headline accordingly.
          Hide
          jlowe Jason Lowe added a comment -

          Patch that adds a termination check for the reservation key traversal loop and a unit test.

          Show
          jlowe Jason Lowe added a comment - Patch that adds a termination check for the reservation key traversal loop and a unit test.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 19s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 14m 31s trunk passed
          +1 compile 0m 32s trunk passed
          +1 checkstyle 0m 26s trunk passed
          +1 mvnsite 0m 35s trunk passed
          +1 mvneclipse 0m 14s trunk passed
          +1 findbugs 1m 3s trunk passed
          +1 javadoc 0m 22s trunk passed
          +1 mvninstall 0m 30s the patch passed
          +1 compile 0m 30s the patch passed
          +1 javac 0m 30s the patch passed
          +1 checkstyle 0m 22s the patch passed
          +1 mvnsite 0m 31s the patch passed
          +1 mvneclipse 0m 12s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 5s the patch passed
          +1 javadoc 0m 18s the patch passed
          -1 unit 39m 8s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 18s The patch does not generate ASF License warnings.
          62m 16s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue YARN-6354
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12861105/YARN-6354.001.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 4fb5e828bdd0 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 4966a6e
          Default Java 1.8.0_121
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/15424/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/15424/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/15424/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 14m 31s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 35s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 1m 3s trunk passed +1 javadoc 0m 22s trunk passed +1 mvninstall 0m 30s the patch passed +1 compile 0m 30s the patch passed +1 javac 0m 30s the patch passed +1 checkstyle 0m 22s the patch passed +1 mvnsite 0m 31s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 5s the patch passed +1 javadoc 0m 18s the patch passed -1 unit 39m 8s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 62m 16s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6354 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12861105/YARN-6354.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 4fb5e828bdd0 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 4966a6e Default Java 1.8.0_121 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/15424/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/15424/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/15424/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          The TestRMRestart failure is unrelated.

          Show
          jlowe Jason Lowe added a comment - The TestRMRestart failure is unrelated.
          Hide
          eepayne Eric Payne added a comment -

          +1.

          Thanks Jason Lowe. I will merge to trunk, branch-2, and branch-2.8.

          Show
          eepayne Eric Payne added a comment - +1. Thanks Jason Lowe . I will merge to trunk, branch-2, and branch-2.8.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11509 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11509/)
          YARN-6354. LeveldbRMStateStore can parse invalid keys when recovering (epayne: rev 318bfb01bc6793da09e32e9cc292eb63224b6ca2)

          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestLeveldbRMStateStore.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/LeveldbRMStateStore.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11509 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11509/ ) YARN-6354 . LeveldbRMStateStore can parse invalid keys when recovering (epayne: rev 318bfb01bc6793da09e32e9cc292eb63224b6ca2) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestLeveldbRMStateStore.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/LeveldbRMStateStore.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11591 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11591/)
          YARN-6354. LeveldbRMStateStore can parse invalid keys when recovering (epayne: rev 318bfb01bc6793da09e32e9cc292eb63224b6ca2)

          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestLeveldbRMStateStore.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/LeveldbRMStateStore.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11591 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11591/ ) YARN-6354 . LeveldbRMStateStore can parse invalid keys when recovering (epayne: rev 318bfb01bc6793da09e32e9cc292eb63224b6ca2) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestLeveldbRMStateStore.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/LeveldbRMStateStore.java
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          2.8.1 became a security release. Moving fix-version to 2.8.2 after the fact.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - 2.8.1 became a security release. Moving fix-version to 2.8.2 after the fact.

            People

            • Assignee:
              jlowe Jason Lowe
              Reporter:
              jlowe Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development