Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5166

ConcurrentModificationException in LocalJobRunner

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.2.0, 2.0.4-alpha
    • Fix Version/s: 1.2.0, 2.1.0-beta
    • Component/s: None
    • Labels:
      None

      Description

      With the latest version hive unit tests fail in various places with the following stack trace. The problem seems related to: MAPREDUCE-2931

          [junit] java.util.ConcurrentModificationException
          [junit] 	at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
          [junit] 	at java.util.HashMap$ValueIterator.next(HashMap.java:822)
          [junit] 	at org.apache.hadoop.mapred.Counters.incrAllCounters(Counters.java:505)
          [junit] 	at org.apache.hadoop.mapred.Counters.sum(Counters.java:528)
          [junit] 	at org.apache.hadoop.mapred.LocalJobRunner$Job.getCurrentCounters(LocalJobRunner.java:490)
          [junit] 	at org.apache.hadoop.mapred.LocalJobRunner.getJobCounters(LocalJobRunner.java:634)
          [junit] 	at org.apache.hadoop.mapred.JobClient$NetworkedJob.getCounters(JobClient.java:418)
          [junit] 	at org.apache.hadoop.hive.ql.exec.HadoopJobExecHelper$ExecDriverTaskHandle.getCounters(HadoopJobExecHelper.java:465)
          [junit] 	at org.apache.hadoop.hive.ql.exec.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:300)
          [junit] 	at org.apache.hadoop.hive.ql.exec.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:532)
          [junit] 	at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:453)
          [junit] 	at org.apache.hadoop.hive.ql.exec.ExecDriver.main(ExecDriver.java:681)
          [junit] 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          [junit] 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          [junit] 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          [junit] 	at java.lang.reflect.Method.invoke(Method.java:597)
          [junit] 	at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
      
      1. MAPREDUCE-5166-branch-1.patch
        3 kB
        Sandy Ryza
      2. MAPREDUCE-5166.patch
        3 kB
        Sandy Ryza

        Issue Links

          Activity

          Hide
          Sandy Ryza added a comment -

          Gunther, which version of Hadoop does this appear in?

          Show
          Sandy Ryza added a comment - Gunther, which version of Hadoop does this appear in?
          Hide
          Arun C Murthy added a comment -

          Sandy Ryza This is branch-1.

          Show
          Arun C Murthy added a comment - Sandy Ryza This is branch-1.
          Hide
          Arun C Murthy added a comment -

          Since Hive is badly broken on branch-1, I think the prudent course of action is to revert MAPREDUCE-2931 from branch-1 for now. Thoughts?

          Show
          Arun C Murthy added a comment - Since Hive is badly broken on branch-1, I think the prudent course of action is to revert MAPREDUCE-2931 from branch-1 for now. Thoughts?
          Hide
          Sandy Ryza added a comment -

          Looking into this. The relevant code is also in trunk (from MAPREDUCE-1367), so it's likely that the problem exists there as well.

          Is this occurring in every run? Any steps for how to reproduce?

          Show
          Sandy Ryza added a comment - Looking into this. The relevant code is also in trunk (from MAPREDUCE-1367 ), so it's likely that the problem exists there as well. Is this occurring in every run? Any steps for how to reproduce?
          Hide
          Arun C Murthy added a comment -

          Sandy Ryza Thanks for looking into it. We are seeing tens of Hive tests fail with this.

          Gunther Hagleitner - Can you please let Sandy Ryza know a few? Thanks!

          Show
          Arun C Murthy added a comment - Sandy Ryza Thanks for looking into it. We are seeing tens of Hive tests fail with this. Gunther Hagleitner - Can you please let Sandy Ryza know a few? Thanks!
          Hide
          Tom White added a comment -

          Yes, it would be prudent to revert MAPREDUCE-2931 from branch-1 while this is diagnosed and fixed.

          Looking at the code, I wonder if the problem is with using the static EMPTY_COUNTERS (e.g. in getCurrentCounters()) rather than creating a new empty Counters object each time?

          BTW this shouldn't affect the released version 1.1.2 since MAPREDUCE-2931 went in 1.2.0.

          Show
          Tom White added a comment - Yes, it would be prudent to revert MAPREDUCE-2931 from branch-1 while this is diagnosed and fixed. Looking at the code, I wonder if the problem is with using the static EMPTY_COUNTERS (e.g. in getCurrentCounters()) rather than creating a new empty Counters object each time? BTW this shouldn't affect the released version 1.1.2 since MAPREDUCE-2931 went in 1.2.0.
          Hide
          Arun C Murthy added a comment -

          Actually, getCurrentCounters is badly broken.

          Because it's modifying EMPTY_COUNTERS by summing into it... the same problem exists in trunk.

          Similarly, LJR.initCounters sets up all mapCounters & reduceCounters to be equal to EMPTY_COUNTERS.

          Show
          Arun C Murthy added a comment - Actually, getCurrentCounters is badly broken. Because it's modifying EMPTY_COUNTERS by summing into it... the same problem exists in trunk. Similarly, LJR.initCounters sets up all mapCounters & reduceCounters to be equal to EMPTY_COUNTERS.
          Hide
          Arun C Murthy added a comment -

          We should try and void the "a = b" paradigm, when a & b are of type Counters.

          Show
          Arun C Murthy added a comment - We should try and void the "a = b" paradigm, when a & b are of type Counters.
          Hide
          Arun C Murthy added a comment -

          Furthermore, iterating over Counters isn't thread-safe (copy-safe) since the map-thread might be modifying the Counter, the RPC layer saves us since it makes a safe (sync) copy.

          I wonder what else exhibits this problem in LJR, I mean other than Counters...

          Show
          Arun C Murthy added a comment - Furthermore, iterating over Counters isn't thread-safe (copy-safe) since the map-thread might be modifying the Counter, the RPC layer saves us since it makes a safe (sync) copy. I wonder what else exhibits this problem in LJR, I mean other than Counters...
          Hide
          Arun C Murthy added a comment -

          Thought - how about inventing a deep-copy (Writable.write & Writable.readFields) RPC layer which doesn't use sockets? This seems the safest fix to prevent all such issues for LJR in future...

          Show
          Arun C Murthy added a comment - Thought - how about inventing a deep-copy (Writable.write & Writable.readFields) RPC layer which doesn't use sockets? This seems the safest fix to prevent all such issues for LJR in future...
          Hide
          Arun C Murthy added a comment -

          I had a brief IRC chat with Tom White.

          Rather than revert, I propose we do the following ASAP:

          1. Fix re-use of EMPTY_COUNTERS by using 'new Counters()' everywhere; and implement a RO version of EMPTY_COUNTERS which throws exceptions on modifications (to be really safe).
          2. Fix LJR.statusUpdate to do a deep-copy of Counters using Writable.write & Writable.readFields to simulate the RPC semantics.
          3. Pray that we don't get bitten by this same issue (lack of copy in RPC layer)... smile. Of course, we need to review all implementations of TaskUmbilical/JobSubmissionProtocol methods in LJR to verify that Counters is the only problem.

          Thoughts?

          Show
          Arun C Murthy added a comment - I had a brief IRC chat with Tom White . Rather than revert, I propose we do the following ASAP: Fix re-use of EMPTY_COUNTERS by using 'new Counters()' everywhere; and implement a RO version of EMPTY_COUNTERS which throws exceptions on modifications (to be really safe). Fix LJR.statusUpdate to do a deep-copy of Counters using Writable.write & Writable.readFields to simulate the RPC semantics. Pray that we don't get bitten by this same issue (lack of copy in RPC layer)... smile . Of course, we need to review all implementations of TaskUmbilical/JobSubmissionProtocol methods in LJR to verify that Counters is the only problem. Thoughts?
          Hide
          Arun C Murthy added a comment -

          Sandy Ryza Is there a chance you can take a crack at this ASAP (assuming you are ok with the proposal). The urgency stems from the fact that we need this fixed for 1.2 release (and 2.x of course). Thanks!

          Show
          Arun C Murthy added a comment - Sandy Ryza Is there a chance you can take a crack at this ASAP (assuming you are ok with the proposal). The urgency stems from the fact that we need this fixed for 1.2 release (and 2.x of course). Thanks!
          Hide
          Sandy Ryza added a comment -

          Arun C Murthy, looking at this right now and will have a patch by today. The re-use of EMPTY_COUNTERS was my first hunch when I took a look at this yesterday, but, while I agree that using new Counters() everywhere is better practice, I don't think that's what's causing the problem. EMPTY_COUNTERS is used as an operand in sums, but never summed into. 2. seems more likely to fix it. I'll post a patch that does both of these just to be safe.

          I ran a multi-map local job and repeatedly called getCounters on it to try to reproduce the problem, but couldn't get the exception. Gunther Hagleitner, if I post a patch, would it be easy for you to verify it?

          Show
          Sandy Ryza added a comment - Arun C Murthy , looking at this right now and will have a patch by today. The re-use of EMPTY_COUNTERS was my first hunch when I took a look at this yesterday, but, while I agree that using new Counters() everywhere is better practice, I don't think that's what's causing the problem. EMPTY_COUNTERS is used as an operand in sums, but never summed into. 2. seems more likely to fix it. I'll post a patch that does both of these just to be safe. I ran a multi-map local job and repeatedly called getCounters on it to try to reproduce the problem, but couldn't get the exception. Gunther Hagleitner , if I post a patch, would it be easy for you to verify it?
          Hide
          Sandy Ryza added a comment -

          Uploaded a patch for branch-1

          Show
          Sandy Ryza added a comment - Uploaded a patch for branch-1
          Hide
          Arun C Murthy added a comment -

          Sandy Ryza Thanks for taking this up!

          The deep-copy of TaskStatus might be an overkill since deep-copy of Counters would suffice, but I think it's good to be paranoid.

          Only other point - can we delete EMPTY_COUNTERS since it's not used anymore?

          If you can update the patch, I'll commit right-away to unblock Hive.

          Also, can you please post a similar patch for trunk? Thanks!

          Show
          Arun C Murthy added a comment - Sandy Ryza Thanks for taking this up! The deep-copy of TaskStatus might be an overkill since deep-copy of Counters would suffice, but I think it's good to be paranoid. Only other point - can we delete EMPTY_COUNTERS since it's not used anymore? If you can update the patch, I'll commit right-away to unblock Hive. Also, can you please post a similar patch for trunk? Thanks!
          Hide
          Arun C Murthy added a comment -

          Only other point - can we delete EMPTY_COUNTERS since it's not used anymore?

          Uh, never mind I missed this hunk.

          Show
          Arun C Murthy added a comment - Only other point - can we delete EMPTY_COUNTERS since it's not used anymore? Uh, never mind I missed this hunk.
          Hide
          Arun C Murthy added a comment -

          I'm going to commit the branch-1 patch to unblock Hive testing, while Sandy Ryza updates a patch for trunk.

          Show
          Arun C Murthy added a comment - I'm going to commit the branch-1 patch to unblock Hive testing, while Sandy Ryza updates a patch for trunk.
          Hide
          Arun C Murthy added a comment -

          I just committed to branch-1 & branch-1.2 to unblock Hive testing. I ran TestLocalRunner and worked fine, as expected.

          I'll close this jira once we commit similar fix to trunk. Thanks Sandy!

          Show
          Arun C Murthy added a comment - I just committed to branch-1 & branch-1.2 to unblock Hive testing. I ran TestLocalRunner and worked fine, as expected. I'll close this jira once we commit similar fix to trunk. Thanks Sandy!
          Hide
          Sandy Ryza added a comment -

          Thanks Arun! I'll put a trunk patch up today.

          Show
          Sandy Ryza added a comment - Thanks Arun! I'll put a trunk patch up today.
          Hide
          Sandy Ryza added a comment -

          Uploaded trunk patch

          Show
          Sandy Ryza added a comment - Uploaded trunk patch
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12580418/MAPREDUCE-5166.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3545//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3545//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580418/MAPREDUCE-5166.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3545//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3545//console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          I just committed this. Thanks Sandy!

          Show
          Arun C Murthy added a comment - I just committed this. Thanks Sandy!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3659 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3659/)
          MAPREDUCE-5166. Fix ConcurrentModificationException due to insufficient synchronization on updates to task Counters. Contributed by Sandy Ryza. (Revision 1471796)

          Result = SUCCESS
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1471796
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalJobRunner.java
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #3659 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3659/ ) MAPREDUCE-5166 . Fix ConcurrentModificationException due to insufficient synchronization on updates to task Counters. Contributed by Sandy Ryza. (Revision 1471796) Result = SUCCESS acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1471796 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalJobRunner.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #194 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/194/)
          MAPREDUCE-5166. Fix ConcurrentModificationException due to insufficient synchronization on updates to task Counters. Contributed by Sandy Ryza. (Revision 1471796)

          Result = SUCCESS
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1471796
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalJobRunner.java
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #194 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/194/ ) MAPREDUCE-5166 . Fix ConcurrentModificationException due to insufficient synchronization on updates to task Counters. Contributed by Sandy Ryza. (Revision 1471796) Result = SUCCESS acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1471796 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalJobRunner.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1383 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1383/)
          MAPREDUCE-5166. Fix ConcurrentModificationException due to insufficient synchronization on updates to task Counters. Contributed by Sandy Ryza. (Revision 1471796)

          Result = FAILURE
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1471796
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalJobRunner.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1383 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1383/ ) MAPREDUCE-5166 . Fix ConcurrentModificationException due to insufficient synchronization on updates to task Counters. Contributed by Sandy Ryza. (Revision 1471796) Result = FAILURE acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1471796 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalJobRunner.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1410 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1410/)
          MAPREDUCE-5166. Fix ConcurrentModificationException due to insufficient synchronization on updates to task Counters. Contributed by Sandy Ryza. (Revision 1471796)

          Result = SUCCESS
          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1471796
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalJobRunner.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1410 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1410/ ) MAPREDUCE-5166 . Fix ConcurrentModificationException due to insufficient synchronization on updates to task Counters. Contributed by Sandy Ryza. (Revision 1471796) Result = SUCCESS acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1471796 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalJobRunner.java
          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop 1.2.0.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.

            People

            • Assignee:
              Sandy Ryza
              Reporter:
              Gunther Hagleitner
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development