Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3969

ConcurrentModificationException in JobHistory.java

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: 0.20.203.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      cdh3u1 hadoop distribution, centos 5.5.

      Description

      2012-03-01 04:24:47,479 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_201202150320_3709_m_000148_0' has completed task_201202150320_3709_m_000148 successfully.
      2012-03-01 04:24:47,479 INFO org.apache.hadoop.mapred.JobHistory: Logging failed for job job_201202150320_3709removing PrintWriter from FileManager
      2012-03-01 04:24:47,479 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8021, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@61069281, false, false, true, -21317) from <TASKTRACKER-IP>:450
      41: error: java.io.IOException: java.util.ConcurrentModificationException
      java.io.IOException: java.util.ConcurrentModificationException
              at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
              at java.util.AbstractList$Itr.next(AbstractList.java:343)
              at org.apache.hadoop.mapred.JobHistory.log(JobHistory.java:591)
              at org.apache.hadoop.mapred.JobHistory$MapAttempt.logFinished(JobHistory.java:1735)
              at org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2515)
              at org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1200)
              at org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:4539)
              at org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:3503)
              at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3202)
              at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:597)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:396)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
      

      Task task_201202150320_3709_m_000148 was being marked as failed (but had not been restarted) at that moment and job execution was being freezed.

        Activity

        Hide
        Arun C Murthy added a comment -

        Alexey, hadoop-0.20.2 is too old and hadoop-1 or even hadoop-0.20.203+ seem to not have this.

        You might have to check the CDH user-lists.

        Show
        Arun C Murthy added a comment - Alexey, hadoop-0.20.2 is too old and hadoop-1 or even hadoop-0.20.203+ seem to not have this. You might have to check the CDH user-lists.
        Hide
        Harsh J added a comment -

        Arun,

        Did you get a chance to check the issue before closing this? This is very well present in branch-1 too. Reported version was set to 0.20.2 but it was CDH3 per his environment field, which shares much of the same security code, making it worthwhile to at least take a look at.

        Its very rare to hit this, but it is present and it is pretty critical once it hits (and requires a specific TT to restart to resume the tasks, else the job just hangs).

        The fix would be to synchronize the writers list object before entering the loop that can potentially modify the list, inside of JobHistory.log(…):

        + synchronized (writers) {
              for (Iterator<PrintWriter> iter = writers.iterator(); iter.hasNext();) {
                PrintWriter out = iter.next();
                out.println(builder.toString());
                if (out.checkError() && id != null) {
                  LOG.info("Logging failed for job " + id + " removing PrintWriter from FileManager");
                  iter.remove();
                }
              }
        +    }
        

        Its the iter.remove(); conditional call that causes the issue, as it is done without a lock (while others may access the same set of writers in parallel - the fact which makes this doubly very rare to hit).

        Do you agree with the above analysis Arun? If so, I'll post up a patch. Else lemme know what I've missed here.

        Show
        Harsh J added a comment - Arun, Did you get a chance to check the issue before closing this? This is very well present in branch-1 too. Reported version was set to 0.20.2 but it was CDH3 per his environment field, which shares much of the same security code, making it worthwhile to at least take a look at. Its very rare to hit this, but it is present and it is pretty critical once it hits (and requires a specific TT to restart to resume the tasks, else the job just hangs). The fix would be to synchronize the writers list object before entering the loop that can potentially modify the list, inside of JobHistory.log(…): + synchronized (writers) { for (Iterator<PrintWriter> iter = writers.iterator(); iter.hasNext();) { PrintWriter out = iter.next(); out.println(builder.toString()); if (out.checkError() && id != null ) { LOG.info( "Logging failed for job " + id + " removing PrintWriter from FileManager" ); iter.remove(); } } + } Its the iter.remove(); conditional call that causes the issue, as it is done without a lock (while others may access the same set of writers in parallel - the fact which makes this doubly very rare to hit). Do you agree with the above analysis Arun? If so, I'll post up a patch. Else lemme know what I've missed here.
        Hide
        Tom White added a comment -

        I think that the ConcurrentModificationException from the stacktrace in the description was fixed by http://svn.apache.org/viewvc?view=revision&revision=1077739.

        However, it looks like you have found another potential race. Don't you need to lock access for writes to writers too, e.g. in JobHistoryFilesManager#addWriter?

        Show
        Tom White added a comment - I think that the ConcurrentModificationException from the stacktrace in the description was fixed by http://svn.apache.org/viewvc?view=revision&revision=1077739 . However, it looks like you have found another potential race. Don't you need to lock access for writes to writers too, e.g. in JobHistoryFilesManager#addWriter?
        Hide
        Harsh J added a comment -

        Ah it indeed was fixed, for this specific issue. Thanks for catching that!

        I'll take a look at the other spots you mention and come up with another patch.

        Show
        Harsh J added a comment - Ah it indeed was fixed, for this specific issue. Thanks for catching that! I'll take a look at the other spots you mention and come up with another patch.
        Hide
        Harsh J added a comment -

        It appears that such a scenario can't arise. Since I've not seen this issue again after applying the undocumented fixes to my local branch, I'm re-resolving this.

        Thanks Arun and Tom!

        Show
        Harsh J added a comment - It appears that such a scenario can't arise. Since I've not seen this issue again after applying the undocumented fixes to my local branch, I'm re-resolving this. Thanks Arun and Tom!

          People

          • Assignee:
            Harsh J
            Reporter:
            Alexey Zotov
          • Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development