Hadoop YARN
  1. Hadoop YARN
  2. YARN-449

HBase test failures when running against Hadoop 2

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0.3-alpha
    • Fix Version/s: 2.0.4-alpha
    • Component/s: None
    • Labels:
      None

      Description

      Post YARN-429, unit tests for HBase continue to fail since the classpath for the MRAppMaster is not being set correctly.
      Reverting YARN-129 may fix this, but I'm not sure that's the correct solution. My guess is, as Alexandro pointed out in YARN-129, maven classloader magic is messing up java.class.path.

      1. 7904-v5.txt
        2 kB
        Ted Yu
      2. hbase-7904-v3.txt
        1 kB
        Ted Yu
      3. hbase-TestHFileOutputFormat-wip.txt
        1 kB
        Ted Yu
      4. hbase-TestingUtility-wip.txt
        1 kB
        Ted Yu
      5. minimr_randomdir-branch2.txt
        3 kB
        Siddharth Seth

        Issue Links

          Activity

          Hide
          Ted Yu added a comment -

          With this change, I was able to get TestHFileOutputFormat#testWritingPEData to pass.

          Show
          Ted Yu added a comment - With this change, I was able to get TestHFileOutputFormat#testWritingPEData to pass.
          Hide
          Hitesh Shah added a comment -

          This probably will work for a short term until the internal implementation of MiniYarnCluster or any other minicluster for that matter introduces a new config property that it needs/refers to.

          Looking at the hbase tests, it seems like that instead of using the config object returned by the MiniMRCluster and building on top of it, it tries to do some form of a union between 2 confs. In such cases, chances of missing some internal settings are always likely. I believe there was an earlier fix to set the framework.name to 'yarn' to solve something similar to the current problem when hbase starting running tests against 0.23.

          Ted Yu, do you have any comments on the above? Is it possible to change the base test class for hbase unit tests to build upon the config provided by the mini cluster? Any reason for not doing so?

          Show
          Hitesh Shah added a comment - This probably will work for a short term until the internal implementation of MiniYarnCluster or any other minicluster for that matter introduces a new config property that it needs/refers to. Looking at the hbase tests, it seems like that instead of using the config object returned by the MiniMRCluster and building on top of it, it tries to do some form of a union between 2 confs. In such cases, chances of missing some internal settings are always likely. I believe there was an earlier fix to set the framework.name to 'yarn' to solve something similar to the current problem when hbase starting running tests against 0.23. Ted Yu , do you have any comments on the above? Is it possible to change the base test class for hbase unit tests to build upon the config provided by the mini cluster? Any reason for not doing so?
          Hide
          Ted Yu added a comment -

          It is possible to change HBase test class.

          I would spend some time tomorrow in understanding why the following code in MiniYARNCluster doesn't give us expected effect:

              public synchronized void start() {
                try {
                  getConfig().setBoolean(YarnConfiguration.IS_MINI_YARN_CLUSTER, true);
          
          Show
          Ted Yu added a comment - It is possible to change HBase test class. I would spend some time tomorrow in understanding why the following code in MiniYARNCluster doesn't give us expected effect: public synchronized void start() { try { getConfig().setBoolean(YarnConfiguration.IS_MINI_YARN_CLUSTER, true );
          Hide
          Ted Yu added a comment -

          Patch where I try to add "yarn.is.minicluster" at cluster startup

          Show
          Ted Yu added a comment - Patch where I try to add "yarn.is.minicluster" at cluster startup
          Hide
          Ted Yu added a comment -

          The second patch made TestTableMapReduce pass based on 2.0.4-SNAPSHOT:

          Running org.apache.hadoop.hbase.mapreduce.TestTableMapReduce
          2013-03-04 21:09:47.027 java[28537:1203] Unable to load realm info from SCDynamicStore
          2013-03-04 21:09:47.166 java[28537:1203] Unable to load realm info from SCDynamicStore
          Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 238.86 sec
          
          Show
          Ted Yu added a comment - The second patch made TestTableMapReduce pass based on 2.0.4-SNAPSHOT: Running org.apache.hadoop.hbase.mapreduce.TestTableMapReduce 2013-03-04 21:09:47.027 java[28537:1203] Unable to load realm info from SCDynamicStore 2013-03-04 21:09:47.166 java[28537:1203] Unable to load realm info from SCDynamicStore Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 238.86 sec
          Hide
          Ted Yu added a comment -

          I talked with Daniel Dai who showed me how PIG uses mini cluster.
          YARN-129 doesn't affect PIG.

          Show
          Ted Yu added a comment - I talked with Daniel Dai who showed me how PIG uses mini cluster. YARN-129 doesn't affect PIG.
          Hide
          Ted Yu added a comment -

          For Hive, with the following change:

          Index: build.properties
          ===================================================================
          --- build.properties	(revision 1452955)
          +++ build.properties	(working copy)
          @@ -28,13 +28,13 @@
          
           hadoop-0.20.version=0.20.2
           hadoop-0.20S.version=1.0.0
          -hadoop-0.23.version=2.0.0-alpha
          -hadoop.version=${hadoop-0.20.version}
          -hadoop.security.version=${hadoop-0.20S.version}
          +hadoop-0.23.version=2.0.3-alpha
          +hadoop.version=2.0.3-alpha
          +hadoop.security.version=2.0.3-alpha
           # Used to determine which set of Hadoop artifacts we depend on.
           # - 20: hadoop-core, hadoop-test
           # - 23: hadoop-common, hadoop-mapreduce-*, etc
          -hadoop.mr.rev=20
          +hadoop.mr.rev=23
          
           build.dir.hive=${hive.root}/build
           build.dir.hadoop=${build.dir.hive}/hadoopcore
          

          I ran:
          ant test -Dtestcase=TestCliDriver -Dqfile=subq.q -Doverwrite=true -Dtest.silent=false

          It passed.

          Show
          Ted Yu added a comment - For Hive, with the following change: Index: build.properties =================================================================== --- build.properties (revision 1452955) +++ build.properties (working copy) @@ -28,13 +28,13 @@ hadoop-0.20.version=0.20.2 hadoop-0.20S.version=1.0.0 -hadoop-0.23.version=2.0.0-alpha -hadoop.version=${hadoop-0.20.version} -hadoop.security.version=${hadoop-0.20S.version} +hadoop-0.23.version=2.0.3-alpha +hadoop.version=2.0.3-alpha +hadoop.security.version=2.0.3-alpha # Used to determine which set of Hadoop artifacts we depend on. # - 20: hadoop-core, hadoop-test # - 23: hadoop-common, hadoop-mapreduce-*, etc -hadoop.mr.rev=20 +hadoop.mr.rev=23 build.dir.hive=${hive.root}/build build.dir.hadoop=${build.dir.hive}/hadoopcore I ran: ant test -Dtestcase=TestCliDriver -Dqfile=subq.q -Doverwrite=true -Dtest.silent=false It passed.
          Hide
          Siddharth Seth added a comment -

          Ted, good catch on the isMiniYARNCluster property not being set. I assumed that would be in place.

          The MiniCluster, as part of it's startup process, sets parameters like the RM address in the configuration.
          This is then available via createJobConf in MiniMRCluster, or getConfig in MiniMRClientCluster. Instead of selectively copying out parameters, downstream projects should really be using the configuration objects returned by these APIs to submit jobs. That would allow things to keep working if parameters were changed.

          Looked at the PIG code, and that's exactly what it is doing - so the tests passing is expected.
          I'm not sure if Hive unit tests will work. In the test command you pasted, I believe TestCliDriver needs to be replaced with TestMinimrCliDriver to actually get it to use the MiniMRCluster.

          IAC, does it make sense for HBase to make use of config objects returned by the "getConfig" objects so that similar changes in the future don't break unit tests ?

          Show
          Siddharth Seth added a comment - Ted, good catch on the isMiniYARNCluster property not being set. I assumed that would be in place. The MiniCluster, as part of it's startup process, sets parameters like the RM address in the configuration. This is then available via createJobConf in MiniMRCluster, or getConfig in MiniMRClientCluster. Instead of selectively copying out parameters, downstream projects should really be using the configuration objects returned by these APIs to submit jobs. That would allow things to keep working if parameters were changed. Looked at the PIG code, and that's exactly what it is doing - so the tests passing is expected. I'm not sure if Hive unit tests will work. In the test command you pasted, I believe TestCliDriver needs to be replaced with TestMinimrCliDriver to actually get it to use the MiniMRCluster. IAC, does it make sense for HBase to make use of config objects returned by the "getConfig" objects so that similar changes in the future don't break unit tests ?
          Hide
          Ted Yu added a comment -

          Here is the result of using TestMinimrCliDriver:

              [junit] Formatting using clusterid: testClusterID
              [junit] Running org.apache.hadoop.hive.cli.TestMinimrCliDriver
              [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
              [junit] Test org.apache.hadoop.hive.cli.TestMinimrCliDriver FAILED
          
          BUILD FAILED
          /Users/tyu/hive/build.xml:289: The following error occurred while executing this line:
          /Users/tyu/hive/build.xml:291: The following error occurred while executing this line:
          /Users/tyu/hive/build-common.xml:462: Tests failed!
          

          The reason was:

            <error message="org.apache.hadoop.hdfs.MiniDFSCluster.getFileSystem()Lorg/apache/hadoop/fs/FileSystem;" type="java.lang.NoSuchMethodError">java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.MiniDFSCluster.getFileSystem()Lorg/apache/hadoop/fs/FileSystem;
            at org.apache.hadoop.hive.shims.HadoopShimsSecure$MiniDFSShim.getFileSystem(HadoopShimsSecure.java:126)
            at org.apache.hadoop.hive.ql.QTestUtil.&lt;init&gt;(QTestUtil.java:296)
            at org.apache.hadoop.hive.cli.TestMinimrCliDriver.&lt;clinit&gt;(TestMinimrCliDriver.java:45)
            at java.lang.Class.forName0(Native Method)
            at java.lang.Class.forName(Class.java:169)
            at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:373)
            at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:1052)
            at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:906)
          

          I will be in training the next two days, expect some delay in response.

          Show
          Ted Yu added a comment - Here is the result of using TestMinimrCliDriver: [junit] Formatting using clusterid: testClusterID [junit] Running org.apache.hadoop.hive.cli.TestMinimrCliDriver [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.hadoop.hive.cli.TestMinimrCliDriver FAILED BUILD FAILED /Users/tyu/hive/build.xml:289: The following error occurred while executing this line: /Users/tyu/hive/build.xml:291: The following error occurred while executing this line: /Users/tyu/hive/build-common.xml:462: Tests failed! The reason was: <error message= "org.apache.hadoop.hdfs.MiniDFSCluster.getFileSystem()Lorg/apache/hadoop/fs/FileSystem;" type= "java.lang.NoSuchMethodError" >java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.MiniDFSCluster.getFileSystem()Lorg/apache/hadoop/fs/FileSystem; at org.apache.hadoop.hive.shims.HadoopShimsSecure$MiniDFSShim.getFileSystem(HadoopShimsSecure.java:126) at org.apache.hadoop.hive.ql.QTestUtil.&lt;init&gt;(QTestUtil.java:296) at org.apache.hadoop.hive.cli.TestMinimrCliDriver.&lt;clinit&gt;(TestMinimrCliDriver.java:45) at java.lang. Class .forName0(Native Method) at java.lang. Class .forName( Class .java:169) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:373) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:1052) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:906) I will be in training the next two days, expect some delay in response.
          Hide
          Ted Yu added a comment -

          I tried the following change:

          Index: hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
          ===================================================================
          --- hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java	(revision 1453107)
          +++ hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java	(working copy)
          @@ -1578,6 +1578,7 @@
               mrCluster = new MiniMRCluster(servers,
                 FS_URI != null ? FS_URI : FileSystem.get(conf).getUri().toString(), 1,
                 null, null, new JobConf(this.conf));
          +    this.conf = mrCluster.createJobConf();
               JobConf jobConf = MapreduceTestingShim.getJobConf(mrCluster);
               if (jobConf == null) {
                 jobConf = mrCluster.createJobConf();
          

          mapreduce.TestTableMapReduce#testMultiRegionTable hangs running against hadoop 1.0

          Show
          Ted Yu added a comment - I tried the following change: Index: hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java =================================================================== --- hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java (revision 1453107) +++ hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java (working copy) @@ -1578,6 +1578,7 @@ mrCluster = new MiniMRCluster(servers, FS_URI != null ? FS_URI : FileSystem.get(conf).getUri().toString(), 1, null , null , new JobConf( this .conf)); + this .conf = mrCluster.createJobConf(); JobConf jobConf = MapreduceTestingShim.getJobConf(mrCluster); if (jobConf == null ) { jobConf = mrCluster.createJobConf(); mapreduce.TestTableMapReduce#testMultiRegionTable hangs running against hadoop 1.0
          Hide
          Ted Yu added a comment -

          Patch v3 allows mapreduce.TestTableMapReduce#testMultiRegionTable to pass on both hadoop 1.0 and 2.0.4-SNAPSHOT (with YARN-429)

          Show
          Ted Yu added a comment - Patch v3 allows mapreduce.TestTableMapReduce#testMultiRegionTable to pass on both hadoop 1.0 and 2.0.4-SNAPSHOT (with YARN-429 )
          Hide
          Arun C Murthy added a comment -

          Ted Yu Thanks a bunch for helping us debug this. Can you please verify that this is just a HBase issue and we can close this out our end? Thanks again!

          Show
          Arun C Murthy added a comment - Ted Yu Thanks a bunch for helping us debug this. Can you please verify that this is just a HBase issue and we can close this out our end? Thanks again!
          Hide
          Ted Yu added a comment -

          Hbase can adjust as Sid suggested.

          What about Hive ?
          I didn't finish verification for Hive.

          Show
          Ted Yu added a comment - Hbase can adjust as Sid suggested. What about Hive ? I didn't finish verification for Hive.
          Hide
          Siddharth Seth added a comment -

          I tried the Hive tests a couple of days ago. Same problem as what Ted saw - i.e the MiniDFSCluster error. That's an error before the MiniMRCluster is created in Hive. From looking at the Hive unit test code, I exepct them to require similar changes to use the configuration provided by the MiniMRCluster.

          Does it make sense to include an additional method on the MiniClusters - something like getMinimalConfiguration() - which would only include parameters required to talk to the cluster. This would allow downstream projects to pull such configuration from MiniDFS, MiniMR, etc - and then build out a config from all of these.

          Show
          Siddharth Seth added a comment - I tried the Hive tests a couple of days ago. Same problem as what Ted saw - i.e the MiniDFSCluster error. That's an error before the MiniMRCluster is created in Hive. From looking at the Hive unit test code, I exepct them to require similar changes to use the configuration provided by the MiniMRCluster. Does it make sense to include an additional method on the MiniClusters - something like getMinimalConfiguration() - which would only include parameters required to talk to the cluster. This would allow downstream projects to pull such configuration from MiniDFS, MiniMR, etc - and then build out a config from all of these.
          Hide
          Ted Yu added a comment -

          Assuming the JobConf returned by MiniMRCluster contains the conf parameters we passed to MiniMRCluster at time of start, I don't know if getMinimalConfiguration() is needed.
          In HBase we use shim to hide the difference between hadoop 1.0 and 2.0:

              JobConf jobConf = MapreduceTestingShim.getJobConf(mrCluster);
              if (jobConf == null) {
                jobConf = mrCluster.createJobConf();
              }
          
          Show
          Ted Yu added a comment - Assuming the JobConf returned by MiniMRCluster contains the conf parameters we passed to MiniMRCluster at time of start, I don't know if getMinimalConfiguration() is needed. In HBase we use shim to hide the difference between hadoop 1.0 and 2.0: JobConf jobConf = MapreduceTestingShim.getJobConf(mrCluster); if (jobConf == null ) { jobConf = mrCluster.createJobConf(); }
          Hide
          Ted Yu added a comment -

          I couldn't get TestRowCounter to pass using 2.0.4-SNAPSHOT maven artifacts.

          I tried the last two patches attached here, same result.

          Show
          Ted Yu added a comment - I couldn't get TestRowCounter to pass using 2.0.4-SNAPSHOT maven artifacts. I tried the last two patches attached here, same result.
          Hide
          Ted Yu added a comment -

          I used the same command on Mac and the test passed:

          mvn test -Dhadoop.profile=2.0 -PlocalTests -Dtest=TestRowCounter

          Show
          Ted Yu added a comment - I used the same command on Mac and the test passed: mvn test -Dhadoop.profile=2.0 -PlocalTests -Dtest=TestRowCounter
          Hide
          Ted Yu added a comment -

          Here is the Linux OS where tests failed:

          Linux ygridcore.net 2.6.32-220.23.1.el6.YAHOO.20120713.x86_64 #1 SMP Fri Jul 13 11:40:51 CDT 2012 x86_64 x86_64 x86_64 GNU/Linux

          Show
          Ted Yu added a comment - Here is the Linux OS where tests failed: Linux ygridcore.net 2.6.32-220.23.1.el6.YAHOO.20120713.x86_64 #1 SMP Fri Jul 13 11:40:51 CDT 2012 x86_64 x86_64 x86_64 GNU/Linux
          Show
          Ted Yu added a comment - There was test failure using patch v3 against hadoop 1.0 See: https://builds.apache.org/job/PreCommit-HBASE-Build/4719/artifact/trunk/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat.txt
          Hide
          Ted Yu added a comment -

          TestRowCounter failed with hadoop 2.0.2-alpha on the above machine as well.

          I saw the following in test output:

          2013-03-08 18:39:07,078 WARN  [Thread-58] datanode.BlockPoolSliceScanner(648): Received exception:
          java.io.FileNotFoundException: /homes/hortonzy/trunk/hbase-server/target/test-data/30098d58-1320-499d-9cac-2025bd03f25d/dfscluster_19bfcc6a-b942-4692-ab78-8d55040f9fac/dfs/data/data1/current/BP-1507903350-68.142.244.61-1362767892061/dncp_block_verification.log.curr (No such file or directory)
                  at java.io.FileOutputStream.openAppend(Native Method)
                  at java.io.FileOutputStream.<init>(FileOutputStream.java:192)
                  at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.RollingLogsImpl.roll(RollingLogsImpl.java:111)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.rollVerificationLogs(BlockPoolSliceScanner.java:646)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scan(BlockPoolSliceScanner.java:636)
                  at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scanBlockPoolSlice(BlockPoolSliceScanner.java:599)
                  at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.run(DataBlockScanner.java:101)
                  at java.lang.Thread.run(Thread.java:662)
          2013-03-08 18:39:16,015 WARN  [AsyncDispatcher event handler] event.AsyncDispatcher$1(71): AsyncDispatcher thread interrupted
          java.lang.InterruptedException
                  at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1961)
                  at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1996)
                  at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
                  at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:69)
                  at java.lang.Thread.run(Thread.java:662)
          
          Show
          Ted Yu added a comment - TestRowCounter failed with hadoop 2.0.2-alpha on the above machine as well. I saw the following in test output: 2013-03-08 18:39:07,078 WARN [ Thread -58] datanode.BlockPoolSliceScanner(648): Received exception: java.io.FileNotFoundException: /homes/hortonzy/trunk/hbase-server/target/test-data/30098d58-1320-499d-9cac-2025bd03f25d/dfscluster_19bfcc6a-b942-4692-ab78-8d55040f9fac/dfs/data/data1/current/BP-1507903350-68.142.244.61-1362767892061/dncp_block_verification.log.curr (No such file or directory) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:192) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.RollingLogsImpl.roll(RollingLogsImpl.java:111) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.rollVerificationLogs(BlockPoolSliceScanner.java:646) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scan(BlockPoolSliceScanner.java:636) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scanBlockPoolSlice(BlockPoolSliceScanner.java:599) at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.run(DataBlockScanner.java:101) at java.lang. Thread .run( Thread .java:662) 2013-03-08 18:39:16,015 WARN [AsyncDispatcher event handler] event.AsyncDispatcher$1(71): AsyncDispatcher thread interrupted java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1961) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1996) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:69) at java.lang. Thread .run( Thread .java:662)
          Hide
          Ted Yu added a comment -

          Patch v5 was suggested by Sid.

          Show
          Ted Yu added a comment - Patch v5 was suggested by Sid.
          Hide
          Ted Yu added a comment -

          I tried 3 methods which resulted in different mapreduce jobs to fail.

          1. copy JobConf returned from MiniMRCluster.getJobConf() to HBaseTestingUtility.conf
          here is test result based on hadoop 1.0 (org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat failed):
          https://builds.apache.org/job/PreCommit-HBASE-Build/4719//testReport/

          2. extract config entries from MiniMRCluster.getJobConf() which are absent in HBaseTestingUtility.conf and put them in HBaseTestingUtility.conf
          mapreduce.TestRowCounter failed building against hadoop 2.0:
          https://builds.apache.org/job/PreCommit-HBASE-Build/4743//testReport/

          3. extract all config entries from MiniMRCluster.getJobConf() and put them in HBaseTestingUtility.conf
          mapreduce.TestImportExport failed building against hadoop 2.0:
          https://builds.apache.org/job/PreCommit-HBASE-Build/4746//testReport/

          Show
          Ted Yu added a comment - I tried 3 methods which resulted in different mapreduce jobs to fail. 1. copy JobConf returned from MiniMRCluster.getJobConf() to HBaseTestingUtility.conf here is test result based on hadoop 1.0 (org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat failed): https://builds.apache.org/job/PreCommit-HBASE-Build/4719//testReport/ 2. extract config entries from MiniMRCluster.getJobConf() which are absent in HBaseTestingUtility.conf and put them in HBaseTestingUtility.conf mapreduce.TestRowCounter failed building against hadoop 2.0: https://builds.apache.org/job/PreCommit-HBASE-Build/4743//testReport/ 3. extract all config entries from MiniMRCluster.getJobConf() and put them in HBaseTestingUtility.conf mapreduce.TestImportExport failed building against hadoop 2.0: https://builds.apache.org/job/PreCommit-HBASE-Build/4746//testReport/
          Hide
          Ted Yu added a comment -

          Under hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster, I found several directories.
          I looked at:
          org.apache.hadoop.mapred.MiniMRCluster-logDir-nm-0_0/application_1363018478441_0002/container_1363018478441_0002_01_000001/syslog

          but didn't find much clue.

          Show
          Ted Yu added a comment - Under hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster, I found several directories. I looked at: org.apache.hadoop.mapred.MiniMRCluster-logDir-nm-0_0/application_1363018478441_0002/container_1363018478441_0002_01_000001/syslog but didn't find much clue.
          Hide
          Siddharth Seth added a comment -

          Ted, are all these tests passing against 2.0.2-alpha ? That seems like the first thing that should be addressed - did something changed in 2.0.3 which causes them to fail.
          Also, does HBase run tests in parallel ? Three different tests failing in different test-runs on the build box could be caused by this. Looking at the way MiniMRCluster works, the work-dir ends up being the same across tests - so that can be problematic. (Attaching a wip patch for this which I've tested on a couple of HBase tests - running individually). Would it be possible for you to try the patch on a box where you're seeing failures.
          From running the three tests locally - TestImportExport consistently hangs. The other two seem to work ok.

          Show
          Siddharth Seth added a comment - Ted, are all these tests passing against 2.0.2-alpha ? That seems like the first thing that should be addressed - did something changed in 2.0.3 which causes them to fail. Also, does HBase run tests in parallel ? Three different tests failing in different test-runs on the build box could be caused by this. Looking at the way MiniMRCluster works, the work-dir ends up being the same across tests - so that can be problematic. (Attaching a wip patch for this which I've tested on a couple of HBase tests - running individually). Would it be possible for you to try the patch on a box where you're seeing failures. From running the three tests locally - TestImportExport consistently hangs. The other two seem to work ok.
          Hide
          Siddharth Seth added a comment -

          randomize directories generated when using MiniMRCluster.

          Show
          Siddharth Seth added a comment - randomize directories generated when using MiniMRCluster.
          Hide
          Ted Yu added a comment -

          Test failure couldn't be reproduced on Mac.
          On flubber machine, protoc 2.5 gave me error.
          I will try to build with your patch on flubber.

          Thanks

          Show
          Ted Yu added a comment - Test failure couldn't be reproduced on Mac. On flubber machine, protoc 2.5 gave me error. I will try to build with your patch on flubber. Thanks
          Hide
          Ted Yu added a comment -

          On flubber, I installed protoc 2.4.1 but couldn't use it:

          $ protoc --version
          protoc: error while loading shared libraries: libprotobuf.so.7: cannot open shared object file: No such file or directory
          

          I applied minimr_randomdir-branch2.txt locally and ran the following command:

          mt -Dhadoop.profile=2.0 -Dtest=org.apache.hadoop.hbase.mapreduce.TestTableMapReduce#testMultiRegionTable

          The test passed.

          Show
          Ted Yu added a comment - On flubber, I installed protoc 2.4.1 but couldn't use it: $ protoc --version protoc: error while loading shared libraries: libprotobuf.so.7: cannot open shared object file: No such file or directory I applied minimr_randomdir-branch2.txt locally and ran the following command: mt -Dhadoop.profile=2.0 -Dtest=org.apache.hadoop.hbase.mapreduce.TestTableMapReduce#testMultiRegionTable The test passed.
          Hide
          Ted Yu added a comment -

          From https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/442/testReport/org.apache.hadoop.hbase.mapreduce/TestRowCounter/testRowCounterNoColumn/ (hadoop-2.0.2-alpha was used):

          2013-03-12 05:44:18,139 WARN  [DeletionService #1] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_0/usercache/jenkins/appcache/application_1363067018215_0001/container_1363067018215_0001_01_000001]
          2013-03-12 05:44:18,145 WARN  [AsyncDispatcher event handler] nodemanager.NMAuditLogger(150): USER=jenkins	OPERATION=Container Finished - Failed	TARGET=ContainerImpl	RESULT=FAILURE	DESCRIPTION=Container failed with state: LOCALIZATION_FAILED	APPID=application_1363067018215_0001	CONTAINERID=container_1363067018215_0001_01_000001
          2013-03-12 05:44:18,141 WARN  [DeletionService #0] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_1/usercache/jenkins/appcache/application_1363067018215_0001/container_1363067018215_0001_01_000001]
          2013-03-12 05:44:18,220 WARN  [DeletionService #0] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_2/usercache/jenkins/appcache/application_1363067018215_0001/container_1363067018215_0001_01_000001]
          2013-03-12 05:44:18,220 WARN  [DeletionService #0] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_3/usercache/jenkins/appcache/application_1363067018215_0001/container_1363067018215_0001_01_000001]
          2013-03-12 05:44:18,865 WARN  [AsyncDispatcher event handler] resourcemanager.RMAuditLogger(255): USER=jenkins	OPERATION=Application Finished - Failed	TARGET=RMAppManager	RESULT=FAILURE	DESCRIPTION=App failed with state: FAILED	PERMISSIONS=Application application_1363067018215_0001 failed 1 times due to AM Container for appattempt_1363067018215_0001_000001 exited with  exitCode: -1000 due to: RemoteTrace: 
          java.io.IOException: Unable to rename file: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_1/usercache/jenkins/filecache/5596410335910248146_tmp/hadoop-262140332608909552.jar.tmp] to [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_1/usercache/jenkins/filecache/5596410335910248146_tmp/hadoop-262140332608909552.jar]
          	at org.apache.hadoop.yarn.util.FSDownload.unpack(FSDownload.java:162)
          	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:205)
          	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50)
          	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
          	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          	at java.lang.Thread.run(Thread.java:662)
           at LocalTrace: 
          	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unable to rename file: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_1/usercache/jenkins/filecache/5596410335910248146_tmp/hadoop-262140332608909552.jar.tmp] to [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_1/usercache/jenkins/filecache/5596410335910248146_tmp/hadoop-262140332608909552.jar]
          	at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
          	at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
          	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:824)
          	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:493)
          	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:222)
          	at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
          	at org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
          	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
          	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:910)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1694)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1690)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:396)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1367)
          	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1688)
          
          .Failing this attempt.. Failing the application.	APPID=application_1363067018215_0001
          2013-03-12 05:44:19,863 WARN  [DeletionService #2] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_0/usercache/jenkins/appcache/application_1363067018215_0001]
          2013-03-12 05:44:19,864 WARN  [DeletionService #2] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_1/usercache/jenkins/appcache/application_1363067018215_0001]
          2013-03-12 05:44:19,864 WARN  [DeletionService #2] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_2/usercache/jenkins/appcache/application_1363067018215_0001]
          2013-03-12 05:44:19,864 WARN  [DeletionService #2] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_3/usercache/jenkins/appcache/application_1363067018215_0001]
          
          Show
          Ted Yu added a comment - From https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/442/testReport/org.apache.hadoop.hbase.mapreduce/TestRowCounter/testRowCounterNoColumn/ (hadoop-2.0.2-alpha was used): 2013-03-12 05:44:18,139 WARN [DeletionService #1] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_0/usercache/jenkins/appcache/application_1363067018215_0001/container_1363067018215_0001_01_000001] 2013-03-12 05:44:18,145 WARN [AsyncDispatcher event handler] nodemanager.NMAuditLogger(150): USER=jenkins OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1363067018215_0001 CONTAINERID=container_1363067018215_0001_01_000001 2013-03-12 05:44:18,141 WARN [DeletionService #0] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_1/usercache/jenkins/appcache/application_1363067018215_0001/container_1363067018215_0001_01_000001] 2013-03-12 05:44:18,220 WARN [DeletionService #0] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_2/usercache/jenkins/appcache/application_1363067018215_0001/container_1363067018215_0001_01_000001] 2013-03-12 05:44:18,220 WARN [DeletionService #0] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_3/usercache/jenkins/appcache/application_1363067018215_0001/container_1363067018215_0001_01_000001] 2013-03-12 05:44:18,865 WARN [AsyncDispatcher event handler] resourcemanager.RMAuditLogger(255): USER=jenkins OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1363067018215_0001 failed 1 times due to AM Container for appattempt_1363067018215_0001_000001 exited with exitCode: -1000 due to: RemoteTrace: java.io.IOException: Unable to rename file: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_1/usercache/jenkins/filecache/5596410335910248146_tmp/hadoop-262140332608909552.jar.tmp] to [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_1/usercache/jenkins/filecache/5596410335910248146_tmp/hadoop-262140332608909552.jar] at org.apache.hadoop.yarn.util.FSDownload.unpack(FSDownload.java:162) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:205) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:662) at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unable to rename file: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_1/usercache/jenkins/filecache/5596410335910248146_tmp/hadoop-262140332608909552.jar.tmp] to [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_1/usercache/jenkins/filecache/5596410335910248146_tmp/hadoop-262140332608909552.jar] at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217) at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:824) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:493) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:222) at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46) at org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:910) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1694) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1690) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1367) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1688) .Failing this attempt.. Failing the application. APPID=application_1363067018215_0001 2013-03-12 05:44:19,863 WARN [DeletionService #2] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_0/usercache/jenkins/appcache/application_1363067018215_0001] 2013-03-12 05:44:19,864 WARN [DeletionService #2] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_1/usercache/jenkins/appcache/application_1363067018215_0001] 2013-03-12 05:44:19,864 WARN [DeletionService #2] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_2/usercache/jenkins/appcache/application_1363067018215_0001] 2013-03-12 05:44:19,864 WARN [DeletionService #2] nodemanager.DefaultContainerExecutor(276): delete returned false for path: [/home/jenkins/jenkins-slave/workspace/HBase-TRUNK-on-Hadoop-2.0.0/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster/org.apache.hadoop.mapred.MiniMRCluster-localDir-nm-1_3/usercache/jenkins/appcache/application_1363067018215_0001]
          Hide
          Siddharth Seth added a comment -
          013-03-12 18:53:39,275 WARN  [Container Monitor] monitor.ContainersMonitorImpl$MonitoringThread(444): Container [pid=8438,containerID=container_1363114400920_0001_01_000001] is running beyond virtual memory limits. Current usage: 217.9 MB of 2 GB physical memory used; 6.5 GB of 4.2 GB virtual memory used. Killing container.
          Dump of the process-tree for container_1363114400920_0001_01_000001 :
                  |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
                  |- 8438 7023 8438 8438 (bash) 1 0 108650496 310 /bin/bash -c /usr/lib/jvm/java-1.6.0-sun-1.6.0.37.x86_64/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.mapreduce.container.log.dir=/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001 -Dyarn.app.mapreduce.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA  -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1&gt;/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001/stdout 2&gt;/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001/stderr
          

          This is what caused TestRowCounter to fail in the linux env. Not sure why the Vmem is going that high. The hadoop-1 default config likely disables this monitoring.

          At this point there seem to be solutions for the original prolem the jira was opened for, and this is really re-purposed to get HBase unit tests working with Hadoop 2. Changing the title accordingly.

          Show
          Siddharth Seth added a comment - 013-03-12 18:53:39,275 WARN [Container Monitor] monitor.ContainersMonitorImpl$MonitoringThread(444): Container [pid=8438,containerID=container_1363114400920_0001_01_000001] is running beyond virtual memory limits. Current usage: 217.9 MB of 2 GB physical memory used; 6.5 GB of 4.2 GB virtual memory used. Killing container. Dump of the process-tree for container_1363114400920_0001_01_000001 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 8438 7023 8438 8438 (bash) 1 0 108650496 310 /bin/bash -c /usr/lib/jvm/java-1.6.0-sun-1.6.0.37.x86_64/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.mapreduce.container.log.dir=/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001 -Dyarn.app.mapreduce.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1&gt;/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001/stdout 2&gt;/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001/stderr This is what caused TestRowCounter to fail in the linux env. Not sure why the Vmem is going that high. The hadoop-1 default config likely disables this monitoring. At this point there seem to be solutions for the original prolem the jira was opened for, and this is really re-purposed to get HBase unit tests working with Hadoop 2. Changing the title accordingly.
          Hide
          Ted Yu added a comment -

          More from TEST-org.apache.hadoop.hbase.mapreduce.TestRowCounter.xml:

          2013-03-12 18:53:39,274 WARN  [Container Monitor] monitor.ContainersMonitorImpl(298): Process tree for container: container_1363114400920_0001_01_000001 has processes older than 1 iteration running over the configured limit. Limit=4509715456, current usage = 7007866880
          2013-03-12 18:53:39,275 WARN  [Container Monitor] monitor.ContainersMonitorImpl$MonitoringThread(444): Container [pid=8438,containerID=container_1363114400920_0001_01_000001] is running beyond virtual memory limits. Current usage: 217.9 MB of 2 GB physical memory used; 6.5 GB of 4.2 GB virtual memory used. Killing container.
          Dump of the process-tree for container_1363114400920_0001_01_000001 :
                  |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
                  |- 8438 7023 8438 8438 (bash) 1 0 108650496 310 /bin/bash -c /usr/lib/jvm/java-1.6.0-sun-1.6.0.37.x86_64/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.mapreduce.container.log.dir=/homes/hortonzy/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001 -Dyarn.app.mapreduce.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA  -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1&gt;/homes/hortonzy/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001/stdout 2&gt;/homes/hortonzy/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001/stderr
                  |- 8461 8438 8438 8438 (java) 688 34 6899216384 55478 /usr/lib/jvm/java-1.6.0-sun-1.6.0.37.x86_64/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.mapreduce.container.log.dir=/homes/hortonzy/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001 -Dyarn.app.mapreduce.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster
          

          Note that 1024m was specified for -Xmx

          Show
          Ted Yu added a comment - More from TEST-org.apache.hadoop.hbase.mapreduce.TestRowCounter.xml: 2013-03-12 18:53:39,274 WARN [Container Monitor] monitor.ContainersMonitorImpl(298): Process tree for container: container_1363114400920_0001_01_000001 has processes older than 1 iteration running over the configured limit. Limit=4509715456, current usage = 7007866880 2013-03-12 18:53:39,275 WARN [Container Monitor] monitor.ContainersMonitorImpl$MonitoringThread(444): Container [pid=8438,containerID=container_1363114400920_0001_01_000001] is running beyond virtual memory limits. Current usage: 217.9 MB of 2 GB physical memory used; 6.5 GB of 4.2 GB virtual memory used. Killing container. Dump of the process-tree for container_1363114400920_0001_01_000001 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 8438 7023 8438 8438 (bash) 1 0 108650496 310 /bin/bash -c /usr/lib/jvm/java-1.6.0-sun-1.6.0.37.x86_64/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.mapreduce.container.log.dir=/homes/hortonzy/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001 -Dyarn.app.mapreduce.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1&gt;/homes/hortonzy/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001/stdout 2&gt;/homes/hortonzy/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001/stderr |- 8461 8438 8438 8438 (java) 688 34 6899216384 55478 /usr/lib/jvm/java-1.6.0-sun-1.6.0.37.x86_64/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.mapreduce.container.log.dir=/homes/hortonzy/trunk/hbase-server/target/org.apache.hadoop.mapred.MiniMRCluster_1035429065/org.apache.hadoop.mapred.MiniMRCluster_1035429065-logDir-nm-1_2/application_1363114400920_0001/container_1363114400920_0001_01_000001 -Dyarn.app.mapreduce.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster Note that 1024m was specified for -Xmx
          Hide
          Ted Yu added a comment -

          Here is sample content for /proc/<PID>/stat

          30873 (sshd) S 30869 30869 30869 0 -1 4202816 360 0 0 0 47 56 0 0 20 0 1 0 741791881 117960704 516 18446744073709551615 1 1 0 0 0 0 0 4096 65536 18446744073709551615 0 0 17 2 0 0 0 0 0
          

          Here is the regex used to parse the stat file:

            private static final Pattern PROCFS_STAT_FILE_FORMAT = Pattern .compile(
              "^([0-9-]+)\\s([^\\s]+)\\s[^\\s]\\s([0-9-]+)\\s([0-9-]+)\\s([0-9-]+)\\s" +
              "([0-9-]+\\s){7}([0-9]+)\\s([0-9]+)\\s([0-9-]+\\s){7}([0-9]+)\\s([0-9]+)" +
              "(\\s[0-9-]+){15}");
          
          Show
          Ted Yu added a comment - Here is sample content for /proc/<PID>/stat 30873 (sshd) S 30869 30869 30869 0 -1 4202816 360 0 0 0 47 56 0 0 20 0 1 0 741791881 117960704 516 18446744073709551615 1 1 0 0 0 0 0 4096 65536 18446744073709551615 0 0 17 2 0 0 0 0 0 Here is the regex used to parse the stat file: private static final Pattern PROCFS_STAT_FILE_FORMAT = Pattern .compile( "^([0-9-]+)\\s([^\\s]+)\\s[^\\s]\\s([0-9-]+)\\s([0-9-]+)\\s([0-9-]+)\\s" + "([0-9-]+\\s){7}([0-9]+)\\s([0-9]+)\\s([0-9-]+\\s){7}([0-9]+)\\s([0-9]+)" + "(\\s[0-9-]+){15}" );
          Hide
          Ted Yu added a comment -

          Here is OS for Hadoop QA machine:
          Linux asf002.sp2.ygridcore.net 2.6.32-33-server #71-Ubuntu SMP Wed Jul 20 17:42:25 UTC 2011 x86_64 GNU/Linux

          Here is OS for the machine where I ran unit test manually:
          Linux ygridcore.net 2.6.32-220.23.1.el6.YAHOO.20120713.x86_64 #1 SMP Fri Jul 13 11:40:51 CDT 2012 x86_64 x86_64 x86_64 GNU/Linux

          Show
          Ted Yu added a comment - Here is OS for Hadoop QA machine: Linux asf002.sp2.ygridcore.net 2.6.32-33-server #71-Ubuntu SMP Wed Jul 20 17:42:25 UTC 2011 x86_64 GNU/Linux Here is OS for the machine where I ran unit test manually: Linux ygridcore.net 2.6.32-220.23.1.el6.YAHOO.20120713.x86_64 #1 SMP Fri Jul 13 11:40:51 CDT 2012 x86_64 x86_64 x86_64 GNU/Linux
          Hide
          Ted Yu added a comment -
          Show
          Ted Yu added a comment - Based on http://54.241.6.143/job/HBase-0.95-Hadoop-2/org.apache.hbase$hbase-server/49/testReport/ and http://54.241.6.143/job/HBase-TRUNK-Hadoop-2/50/console , there was no mapreduce job failure after HBASE-7904 was checked in. This JIRA can be resolved.
          Hide
          Siddharth Seth added a comment -

          Summarizing for reference, and closing this since HBASE-7904 is committed.
          Critical fixes to YARN
          YARN-429 fixes capacity-scheduler.conf being unavailable.
          Additional fixes which (in most cases) don't affet HBASE, but could affect other projects using the MiniMRCluster
          YARN-470 and MAPREDUCE-5094 (wip) - disable memory monitoring by default in the MiniMRYARNCluster
          MAPREDUCE-5083 - Randomize MiniMRCluster directory component, to allow parallel instances

          Changes to HBase (HBASE-7904)
          Fixes to TestImportExport to copy yarn configuration in all cases.
          Changes to HBaseTestingUtility to merge the configuration returned by MiniMRCluster.
          Chane to HRegion to not use a 'CompoundConfiguration' when talking to HDFS, due to the inability to change the RPC engine with CompoundConfiguration. (Causes several *SecureLoad tests to hang)

          Show
          Siddharth Seth added a comment - Summarizing for reference, and closing this since HBASE-7904 is committed. Critical fixes to YARN YARN-429 fixes capacity-scheduler.conf being unavailable. Additional fixes which (in most cases) don't affet HBASE, but could affect other projects using the MiniMRCluster YARN-470 and MAPREDUCE-5094 (wip) - disable memory monitoring by default in the MiniMRYARNCluster MAPREDUCE-5083 - Randomize MiniMRCluster directory component, to allow parallel instances Changes to HBase ( HBASE-7904 ) Fixes to TestImportExport to copy yarn configuration in all cases. Changes to HBaseTestingUtility to merge the configuration returned by MiniMRCluster. Chane to HRegion to not use a 'CompoundConfiguration' when talking to HDFS, due to the inability to change the RPC engine with CompoundConfiguration. (Causes several *SecureLoad tests to hang)
          Hide
          stack added a comment -

          I would not build any argument or deduction on the mess that went on over in hbase-7904.

          Show
          stack added a comment - I would not build any argument or deduction on the mess that went on over in hbase-7904.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I would not build any argument or deduction on the mess that went on over in hbase-7904.

          That's good news. Really. Any constructive feedback to avoid the 'mess' in future is deeply appreciated though.

          Sid and Ted, thanks for being on top of this!

          Show
          Vinod Kumar Vavilapalli added a comment - I would not build any argument or deduction on the mess that went on over in hbase-7904. That's good news. Really. Any constructive feedback to avoid the 'mess' in future is deeply appreciated though. Sid and Ted, thanks for being on top of this!

            People

            • Assignee:
              Unassigned
              Reporter:
              Siddharth Seth
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development