Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-9079

LocalDirAllocator throws ArithmeticException

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 2.0.3-alpha
    • None
    • None
    • None

    Description

      2012-11-19 22:07:41,709 WARN [IPC Server handler 0 on 38671] nodemanager.NMAuditLogger(150): USER=UnknownUser IP=**** OPERATION=Stop Container Request TARGET=ContainerManagerImpl RESULT=FAILURE DESCRIPTION=Trying to stop unknown container! APPID=application_1353391620476_0001 CONTAINERID=container_1353391620476_0001_01_000010
      java.lang.ArithmeticException: / by zero
      at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:368)
      at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
      at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
      at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:115)
      at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForWrite(LocalDirsHandlerService.java:263)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:849)

      Attachments

        1. trunk-9079.patch
          0.9 kB
          Jimmy Xiang
        2. hadoop-9079-v2.txt
          1 kB
          Ted Yu

        Issue Links

          Activity

            eli Eli Collins added a comment -

            I assume this happens when the local dirs are out of space? Don't think just checking totalAvailable > 0 fixes this, the following loop doesn't bounds check dir, needs to be re-written with a test.

                      long randomPosition = Math.abs(r.nextLong()) % totalAvailable;
                      int dir = 0;
                      while (randomPosition > availableOnDisk[dir]) {
                        randomPosition -= availableOnDisk[dir];
                        dir++;
                      }
            
            eli Eli Collins added a comment - I assume this happens when the local dirs are out of space? Don't think just checking totalAvailable > 0 fixes this, the following loop doesn't bounds check dir, needs to be re-written with a test. long randomPosition = Math .abs(r.nextLong()) % totalAvailable; int dir = 0; while (randomPosition > availableOnDisk[dir]) { randomPosition -= availableOnDisk[dir]; dir++; }
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12554534/trunk-9079.patch
            against trunk revision .

            +1 @author. The patch does not contain any @author tags.

            -1 tests included. The patch doesn't appear to include any new or modified tests.
            Please justify why no new tests are needed for this patch.
            Also please list what manual steps were performed to verify this patch.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 javadoc. The javadoc tool did not generate any warning messages.

            +1 eclipse:eclipse. The patch built with eclipse:eclipse.

            +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.

            +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

            +1 contrib tests. The patch passed contrib unit tests.

            Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1790//testReport/
            Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1790//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12554534/trunk-9079.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1790//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1790//console This message is automatically generated.
            jxiang Jimmy Xiang added a comment -

            @Eli, could you add me as a Hadoop contributor?

            This could also happen when the local dirs are not writable.

            The second while loop seems to be bounded implicitly since randomPosition < totalAvailable and sum of (availableOnDisk[dir]) = totalAvailable.

            Let me think it more.

            Sure, will add a test.

            jxiang Jimmy Xiang added a comment - @Eli, could you add me as a Hadoop contributor? This could also happen when the local dirs are not writable. The second while loop seems to be bounded implicitly since randomPosition < totalAvailable and sum of (availableOnDisk [dir] ) = totalAvailable. Let me think it more. Sure, will add a test.
            jxiang Jimmy Xiang added a comment -

            Yes, some local dir should be out of space to reproduce this issue. Is there a good way to simulate a disk out of space?

            jxiang Jimmy Xiang added a comment - Yes, some local dir should be out of space to reproduce this issue. Is there a good way to simulate a disk out of space?
            yuzhihong@gmail.com Ted Yu added a comment -
                  int numDirs = localDirs.length;
            ...
                    long[] availableOnDisk = new long[dirDF.length];
            ...
                    while (numDirsSearched < numDirs && returnPath == null) {
            ...
                      if (returnPath == null) {
                        totalAvailable -= availableOnDisk[dir];
                        availableOnDisk[dir] = 0; // skip this disk
                        numDirsSearched++;
                      }
            

            numDirs is derived from localDirs.length but size of availableOnDisk is governed by dirDF.length
            Should the loop condition be the following instead ?

                    while (numDirsSearched < dirDF.length && returnPath == null) {
            
            yuzhihong@gmail.com Ted Yu added a comment - int numDirs = localDirs.length; ... long [] availableOnDisk = new long [dirDF.length]; ... while (numDirsSearched < numDirs && returnPath == null ) { ... if (returnPath == null ) { totalAvailable -= availableOnDisk[dir]; availableOnDisk[dir] = 0; // skip this disk numDirsSearched++; } numDirs is derived from localDirs.length but size of availableOnDisk is governed by dirDF.length Should the loop condition be the following instead ? while (numDirsSearched < dirDF.length && returnPath == null ) {
            yuzhihong@gmail.com Ted Yu added a comment -

            Patch v2 revises loop condition using dirDF.length

            yuzhihong@gmail.com Ted Yu added a comment - Patch v2 revises loop condition using dirDF.length
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12564802/hadoop-9079-v2.txt
            against trunk revision .

            +1 @author. The patch does not contain any @author tags.

            -1 tests included. The patch doesn't appear to include any new or modified tests.
            Please justify why no new tests are needed for this patch.
            Also please list what manual steps were performed to verify this patch.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 javadoc. The javadoc tool did not generate any warning messages.

            +1 eclipse:eclipse. The patch built with eclipse:eclipse.

            +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

            -1 release audit. The applied patch generated 2 release audit warnings.

            +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

            +1 contrib tests. The patch passed contrib unit tests.

            Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/2042//testReport/
            Release audit warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/2042//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
            Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/2042//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12564802/hadoop-9079-v2.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. -1 release audit . The applied patch generated 2 release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/2042//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/2042//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/2042//console This message is automatically generated.
            yuzhihong@gmail.com Ted Yu added a comment -

            From https://builds.apache.org/job/PreCommit-HBASE-Build/4979/artifact/trunk/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.mapreduce.TestImportExport-output.txt :

            2013-03-22 22:44:16,077 WARN  [AsyncDispatcher event handler] nodemanager.NMAuditLogger(150): USER=jenkins	OPERATION=Container Finished - Failed	TARGET=ContainerImpl	RESULT=FAILURE	DESCRIPTION=Container failed with state: LOCALIZATION_FAILED	APPID=application_1363992198280_0003	CONTAINERID=container_1363992198280_0003_01_000001
            2013-03-22 22:44:17,065 WARN  [AsyncDispatcher event handler] resourcemanager.RMAuditLogger(255): USER=jenkins	OPERATION=Application Finished - Failed	TARGET=RMAppManager	RESULT=FAILURE	DESCRIPTION=App failed with state: FAILED	PERMISSIONS=Application application_1363992198280_0003 failed 1 times due to AM Container for appattempt_1363992198280_0003_000001 exited with  exitCode: -1000 due to: java.lang.ArithmeticException: / by zero
            	at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:368)
            	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
            	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
            	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:115)
            	at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForWrite(LocalDirsHandlerService.java:285)
            	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:846)
            

            hadoop 2.0.4-SNAPSHOT was used in the above test run.

            yuzhihong@gmail.com Ted Yu added a comment - From https://builds.apache.org/job/PreCommit-HBASE-Build/4979/artifact/trunk/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.mapreduce.TestImportExport-output.txt : 2013-03-22 22:44:16,077 WARN [AsyncDispatcher event handler] nodemanager.NMAuditLogger(150): USER=jenkins OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1363992198280_0003 CONTAINERID=container_1363992198280_0003_01_000001 2013-03-22 22:44:17,065 WARN [AsyncDispatcher event handler] resourcemanager.RMAuditLogger(255): USER=jenkins OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1363992198280_0003 failed 1 times due to AM Container for appattempt_1363992198280_0003_000001 exited with exitCode: -1000 due to: java.lang.ArithmeticException: / by zero at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:368) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:115) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForWrite(LocalDirsHandlerService.java:285) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:846) hadoop 2.0.4-SNAPSHOT was used in the above test run.
            stack Michael Stack added a comment -

            ted_yu Does the patch attached fix the fail you see above?

            stack Michael Stack added a comment - ted_yu Does the patch attached fix the fail you see above?
            yuzhihong@gmail.com Ted Yu added a comment -

            The patch needs to be integrated so that it can show up in 2.0.4-SNAPSHOT artifacts, right ?

            yuzhihong@gmail.com Ted Yu added a comment - The patch needs to be integrated so that it can show up in 2.0.4-SNAPSHOT artifacts, right ?
            stack Michael Stack added a comment -

            ted_yu You could build a hadoop2 with patch included. If it fixed hbase issue, this issue would then be about compatibility and so we could raise priority. Patch needs test as per Eli above.

            stack Michael Stack added a comment - ted_yu You could build a hadoop2 with patch included. If it fixed hbase issue, this issue would then be about compatibility and so we could raise priority. Patch needs test as per Eli above.
            yuzhihong@gmail.com Ted Yu added a comment -

            I didn't encounter ArithmeticException locally running against hadoop 2.0.4-SNAPSHOT.

            yuzhihong@gmail.com Ted Yu added a comment - I didn't encounter ArithmeticException locally running against hadoop 2.0.4-SNAPSHOT.
            jmhsieh Jonathan Hsieh added a comment -

            Hey guys, I've been hunting down flakey tests in HBase running against MR2/Yarn/Hadoop2. This is one of the root causes. Can we get this into a hadoop2? tedyu@apache.org, jxiang – either of you going to finish this one?

            jmhsieh Jonathan Hsieh added a comment - Hey guys, I've been hunting down flakey tests in HBase running against MR2/Yarn/Hadoop2. This is one of the root causes. Can we get this into a hadoop2? tedyu@apache.org , jxiang – either of you going to finish this one?
            jmhsieh Jonathan Hsieh added a comment -

            Linked HBASE-8417.

            jmhsieh Jonathan Hsieh added a comment - Linked HBASE-8417 .
            jxiang Jimmy Xiang added a comment -

            jmhsieh, it is hard to come up a unit test for this scenario with big refactory, which is not good. The fix is reasonable. If we can get it in as-is, it will be great.

            jxiang Jimmy Xiang added a comment - jmhsieh , it is hard to come up a unit test for this scenario with big refactory, which is not good. The fix is reasonable. If we can get it in as-is, it will be great.
            tlipcon Todd Lipcon added a comment -

            Could you make a unit test by making a directory, calling setReadable(false) and setExecutable(false), and then trying to set up a LocalDirAllocator inside it?

            tlipcon Todd Lipcon added a comment - Could you make a unit test by making a directory, calling setReadable(false) and setExecutable(false), and then trying to set up a LocalDirAllocator inside it?
            jxiang Jimmy Xiang added a comment -

            I can work on it later when I get some time. Please feel free to take it over if anybody wants to help.

            jxiang Jimmy Xiang added a comment - I can work on it later when I get some time. Please feel free to take it over if anybody wants to help.
            yuzhihong@gmail.com Ted Yu added a comment -

            I made modification in TestLocalDirAllocator#testRWBufferDirBecomesRO:

            Index: hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestLocalDirAllocator.java
            ===================================================================
            --- hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestLocalDirAllocator.java	(revision 1482467)
            +++ hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestLocalDirAllocator.java	(working copy)
            @@ -215,7 +215,9 @@
                   validateTempDirCreation(buildBufferDir(ROOT, nextDirIdx));
            
                   // change buffer directory 2 to be read only
            -      new File(new Path(dir4).toUri().getPath()).setReadOnly();
            +      File f0 = new File(new Path(dir4).toUri().getPath());
            +      f0.setReadable(false);
            +      f0.setExecutable(false);
                   validateTempDirCreation(dir3);
                   validateTempDirCreation(dir3);
                 } finally {
            

            When I ran the test, I got:

            initializationError(org.apache.hadoop.fs.TestLocalDirAllocator)  Time elapsed: 5 sec  <<< FAILURE!
            java.lang.AssertionError:
              at org.junit.Assert.fail(Assert.java:91)
              at org.junit.Assert.assertTrue(Assert.java:43)
              at org.junit.Assert.assertTrue(Assert.java:54)
              at org.apache.hadoop.fs.TestLocalDirAllocator.rmBufferDirs(TestLocalDirAllocator.java:104)
              at org.apache.hadoop.fs.TestLocalDirAllocator.<clinit>(TestLocalDirAllocator.java:71)
            
            yuzhihong@gmail.com Ted Yu added a comment - I made modification in TestLocalDirAllocator#testRWBufferDirBecomesRO: Index: hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestLocalDirAllocator.java =================================================================== --- hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestLocalDirAllocator.java (revision 1482467) +++ hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestLocalDirAllocator.java (working copy) @@ -215,7 +215,9 @@ validateTempDirCreation(buildBufferDir(ROOT, nextDirIdx)); // change buffer directory 2 to be read only - new File( new Path(dir4).toUri().getPath()).setReadOnly(); + File f0 = new File( new Path(dir4).toUri().getPath()); + f0.setReadable( false ); + f0.setExecutable( false ); validateTempDirCreation(dir3); validateTempDirCreation(dir3); } finally { When I ran the test, I got: initializationError(org.apache.hadoop.fs.TestLocalDirAllocator) Time elapsed: 5 sec <<< FAILURE! java.lang.AssertionError: at org.junit.Assert.fail(Assert.java:91) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertTrue(Assert.java:54) at org.apache.hadoop.fs.TestLocalDirAllocator.rmBufferDirs(TestLocalDirAllocator.java:104) at org.apache.hadoop.fs.TestLocalDirAllocator.<clinit>(TestLocalDirAllocator.java:71)
            jxiang Jimmy Xiang added a comment -

            Let me think about it again and come up a test.

            jxiang Jimmy Xiang added a comment - Let me think about it again and come up a test.

            People

              jxiang Jimmy Xiang
              jxiang Jimmy Xiang
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: