Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6354

LeveldbRMStateStore can parse invalid keys when recovering reservations

Details

    Description

      When trying to upgrade an RM to 2.8 it fails with a StringIndexOutOfBoundsException trying to load reservation state.

      Attachments

        1. YARN-6354.001.patch
          4 kB
          Jason Darrell Lowe

        Issue Links

          Activity

            Sample stacktrace:

            2017-03-16 15:17:26,616 INFO  [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state STARTED; cause: java.lang.StringIndexOutOfBoundsException: String index out of range: -17
            java.lang.StringIndexOutOfBoundsException: String index out of range: -17
            	at java.lang.String.substring(String.java:1931)
            	at org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore.loadReservationState(LeveldbRMStateStore.java:289)
            	at org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore.loadState(LeveldbRMStateStore.java:274)
            	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:690)
            	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
            	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1097)
            	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1137)
            	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1133)
            	at java.security.AccessController.doPrivileged(Native Method)
            	at javax.security.auth.Subject.doAs(Subject.java:422)
            	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
            	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1133)
            	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1173)
            	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
            	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1338)
            

            This was broken by YARN-3736. The recovery code is seeking to the RM_RESERVATION_KEY_PREFIX but failing to verify that the keys it sees in the loop actually have that key prefix. Here's the relevant code:

                  iter = new LeveldbIterator(db);
                  iter.seek(bytes(RM_RESERVATION_KEY_PREFIX));
                  while (iter.hasNext()) {
                    Entry<byte[],byte[]> entry = iter.next();
                    String key = asString(entry.getKey());
            
                    String planReservationString =
                        key.substring(RM_RESERVATION_KEY_PREFIX.length());
                    String[] parts = planReservationString.split(SEPARATOR);
                    if (parts.length != 2) {
                      LOG.warn("Incorrect reservation state key " + key);
                      continue;
                    }
            

            The only way to terminate this loop is when the iterator runs out of keys, therefore the iteration loop will scan through all the keys in the database starting at the reservation key to the end. If any key encountered is too short then we'll get the out of bounds exception when we try to do the substring.

            Pinging adhoot and asuresh who were involved in YARN-3736.

            jlowe Jason Darrell Lowe added a comment - Sample stacktrace: 2017-03-16 15:17:26,616 INFO [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state STARTED; cause: java.lang.StringIndexOutOfBoundsException: String index out of range: -17 java.lang.StringIndexOutOfBoundsException: String index out of range: -17 at java.lang.String.substring(String.java:1931) at org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore.loadReservationState(LeveldbRMStateStore.java:289) at org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore.loadState(LeveldbRMStateStore.java:274) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:690) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1097) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1137) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1133) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1133) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1173) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1338) This was broken by YARN-3736 . The recovery code is seeking to the RM_RESERVATION_KEY_PREFIX but failing to verify that the keys it sees in the loop actually have that key prefix. Here's the relevant code: iter = new LeveldbIterator(db); iter.seek(bytes(RM_RESERVATION_KEY_PREFIX)); while (iter.hasNext()) { Entry< byte [], byte []> entry = iter.next(); String key = asString(entry.getKey()); String planReservationString = key.substring(RM_RESERVATION_KEY_PREFIX.length()); String [] parts = planReservationString.split(SEPARATOR); if (parts.length != 2) { LOG.warn( "Incorrect reservation state key " + key); continue ; } The only way to terminate this loop is when the iterator runs out of keys, therefore the iteration loop will scan through all the keys in the database starting at the reservation key to the end. If any key encountered is too short then we'll get the out of bounds exception when we try to do the substring. Pinging adhoot and asuresh who were involved in YARN-3736 .

            I found another instance where a rolling upgrade to 2.8 with leveldb did work successfully, so I dug a bit deeper into why this doesn't always fail. It turns out that normally the reservation state keys happen to be the last keys in the database and therefore it works. If the database happens to have any relatively short keys after the reservation keys then it breaks. My local dev database had some short, lowercase keys leftover in it from some prior work, and that's how I ran into the issue.

            Since it looks like this happens to not be a problem for now with "normal" RM leveldb databases I lowered the severity and updated the headline accordingly.

            jlowe Jason Darrell Lowe added a comment - I found another instance where a rolling upgrade to 2.8 with leveldb did work successfully, so I dug a bit deeper into why this doesn't always fail. It turns out that normally the reservation state keys happen to be the last keys in the database and therefore it works. If the database happens to have any relatively short keys after the reservation keys then it breaks. My local dev database had some short, lowercase keys leftover in it from some prior work, and that's how I ran into the issue. Since it looks like this happens to not be a problem for now with "normal" RM leveldb databases I lowered the severity and updated the headline accordingly.

            Patch that adds a termination check for the reservation key traversal loop and a unit test.

            jlowe Jason Darrell Lowe added a comment - Patch that adds a termination check for the reservation key traversal loop and a unit test.
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 19s Docker mode activated.
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
            +1 mvninstall 14m 31s trunk passed
            +1 compile 0m 32s trunk passed
            +1 checkstyle 0m 26s trunk passed
            +1 mvnsite 0m 35s trunk passed
            +1 mvneclipse 0m 14s trunk passed
            +1 findbugs 1m 3s trunk passed
            +1 javadoc 0m 22s trunk passed
            +1 mvninstall 0m 30s the patch passed
            +1 compile 0m 30s the patch passed
            +1 javac 0m 30s the patch passed
            +1 checkstyle 0m 22s the patch passed
            +1 mvnsite 0m 31s the patch passed
            +1 mvneclipse 0m 12s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 findbugs 1m 5s the patch passed
            +1 javadoc 0m 18s the patch passed
            -1 unit 39m 8s hadoop-yarn-server-resourcemanager in the patch failed.
            +1 asflicense 0m 18s The patch does not generate ASF License warnings.
            62m 16s



            Reason Tests
            Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



            Subsystem Report/Notes
            Docker Image:yetus/hadoop:a9ad5d6
            JIRA Issue YARN-6354
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12861105/YARN-6354.001.patch
            Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
            uname Linux 4fb5e828bdd0 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
            git revision trunk / 4966a6e
            Default Java 1.8.0_121
            findbugs v3.0.0
            unit https://builds.apache.org/job/PreCommit-YARN-Build/15424/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
            Test Results https://builds.apache.org/job/PreCommit-YARN-Build/15424/testReport/
            modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
            Console output https://builds.apache.org/job/PreCommit-YARN-Build/15424/console
            Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 14m 31s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 35s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 1m 3s trunk passed +1 javadoc 0m 22s trunk passed +1 mvninstall 0m 30s the patch passed +1 compile 0m 30s the patch passed +1 javac 0m 30s the patch passed +1 checkstyle 0m 22s the patch passed +1 mvnsite 0m 31s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 5s the patch passed +1 javadoc 0m 18s the patch passed -1 unit 39m 8s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 62m 16s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6354 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12861105/YARN-6354.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 4fb5e828bdd0 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 4966a6e Default Java 1.8.0_121 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/15424/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/15424/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/15424/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.

            The TestRMRestart failure is unrelated.

            jlowe Jason Darrell Lowe added a comment - The TestRMRestart failure is unrelated.
            epayne Eric Payne added a comment -

            +1.

            Thanks jlowe. I will merge to trunk, branch-2, and branch-2.8.

            epayne Eric Payne added a comment - +1. Thanks jlowe . I will merge to trunk, branch-2, and branch-2.8.
            hudson Hudson added a comment -

            SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11509 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11509/)
            YARN-6354. LeveldbRMStateStore can parse invalid keys when recovering (epayne: rev 318bfb01bc6793da09e32e9cc292eb63224b6ca2)

            • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestLeveldbRMStateStore.java
            • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/LeveldbRMStateStore.java
            hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11509 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11509/ ) YARN-6354 . LeveldbRMStateStore can parse invalid keys when recovering (epayne: rev 318bfb01bc6793da09e32e9cc292eb63224b6ca2) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestLeveldbRMStateStore.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/LeveldbRMStateStore.java
            hudson Hudson added a comment -

            SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11591 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11591/)
            YARN-6354. LeveldbRMStateStore can parse invalid keys when recovering (epayne: rev 318bfb01bc6793da09e32e9cc292eb63224b6ca2)

            • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestLeveldbRMStateStore.java
            • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/LeveldbRMStateStore.java
            hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11591 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11591/ ) YARN-6354 . LeveldbRMStateStore can parse invalid keys when recovering (epayne: rev 318bfb01bc6793da09e32e9cc292eb63224b6ca2) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestLeveldbRMStateStore.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/LeveldbRMStateStore.java

            2.8.1 became a security release. Moving fix-version to 2.8.2 after the fact.

            vinodkv Vinod Kumar Vavilapalli added a comment - 2.8.1 became a security release. Moving fix-version to 2.8.2 after the fact.

            People

              jlowe Jason Darrell Lowe
              jlowe Jason Darrell Lowe
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: